SlideShare a Scribd company logo
1 of 25
Download to read offline
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
The Wise Document Translator
AbdelRaouf KESKES
January 26, 2021
1/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Outline
1 Introduction and Problem Setting
2 Our Approach
Global Perspective
Text Detection : CRAFT
Text Recognition : STR
Text Merging
Inpainting
3 Further Improvements
4 Conclusion
References
2/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Introduction and Problem Setting
Introduction and Problem Setting
Problem
A project where I aim to build a system that converts a document from a
language to another keeping all the design (Layout, Logo, Sign, ...)
Why
• Price : Real translators are very expensive and make simple
translation templates that never match the input layout
• Credibility : it adds a kind of credibility and trustworthiness when
we see that the translation match exactly the input layout
• Importance degree : Documents have different degree of
importance, translating a worksheet is not equivalent to translating
a criminal record of a person
3/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Introduction and Problem Setting
Introduction and Problem Setting
Figure: Document translation example
4/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Global Perspective
Our Approach
After Answering the (What?) and the Why?, Let’s answer the How? ...
How?
We divided our problem to 5 sequential sub-problems :
• 1) Text Detection : Localizing word-wise text areas
• 2) Text Recognition : recognizing the words
• 3) Text Merging : merging words to create statements
• 4) Inpainting : Delete the text areas and fill them
• 5) Text Translation : Translate the text and put it back in the
documents
5/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
Text Detection
We used a SOTA model in Text Detection which is called CRAFT
(Character Region Awareness for Text Detection) [1] : best results,
multilingual, open source code, documentation, ...
Figure: Text Detection
6/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Idea and Data
Idea
exploring characters and affinity between them to form a text entity
Data
X are input images whose shape is (N, h, w, 3), and Y are outputs whose shape is (N, h, w, 2)
where :
∗ N is number of images, h is the height and w is the width of the image respectively
∗ for each image Xi of shape (h, w, 3) we have an Yi representing two matrices :
characters score heatmap and affinity/linkage score heatmap
Figure: CRAFT data sample
7/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Data
Synthetic data : where we have characters level annotations
Figure: Synthetic data annotations generation process
8/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Data
Real data : where we have word-level annotations
Figure: Real data annotations generation process
Figure: Zoom in the splitting characters and inverse projection steps 9/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Model + Loss
VGG in a U-Net fashion with skip
connections
Figure: Schematic illustration of the
model Architecture
Loss : Weighted pixel-wise MSE
Sc(p) for synthetic data we obviously set it
to 1
10/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Post Processing
From the 2 predicted heatmaps Sr and Sa we want to generate bounding
boxes for words and it is done as the following :
• We build a binary mask M where a pixel p is set to 1 if (Sr(p) > thr)
or (Sa(p) > tha) where thr and tha are thresholding hyper
parameters.
• We apply CCL(Connected Component Labeling) algorithm on M
• We find Min Area Rectangle covering each component (rotated
rectangles are accepted too since the text could be inclined/rotated )
11/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Recognition : STR
Text Recognition
We used a generalized SOTA model in Text Recognition from the same
team CLOVAAI called STR (Scene Text Recognition) [2]
Figure: Text Recognition
12/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Recognition : STR
STR : Idea and Data
Idea
After showing the fallacies and the inconsistencies raising from STR datasets and unfair
perfomances comparison and benchmarks.They proposed a four stages unified framework
laveraging previous work and also going beyond by exploring their variants on a granular way
and also general way (module wise) combinations
Figure: Four stages STR Model
Data
X are input images whose shape is (N, h, w, 3), where N is number of images, h is the height
and w is the width of the image respectively, Y are ground truth words.
6 datasets: MJSynth, SynthText, IC13, IC15, IIIT, and SVT.
13/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Recognition : STR
STR : Model
The four stages are ...
Transformation: with arbitrarily shaped and curvy texts, the STR network
applies a Thin-Plate Spline (TPS) transformation and normalizes the input
text into a rectangular shape.
Feature Extraction: The transformed images is mapped to a set of
features relevant to character recognition.The authors carried out
experiments on different backbones, namely ResNet, VGG, RCNN.
Sequence Modeling: We use biLSTMs to capture contextual information
(ba? => ba”d”, ba”g”, ba”t”) . However, BiLSTMs suffers from memory
computations cost, so this stage can be selected or deselected as per user
need.
Prediction: This stage estimates the output character sequence from the
identified features of an image.2 options : CTC or Attention
14/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Merging
Text Merging
since row/sentence merging level is sufficient for translation and to keep
the main project goal of the document design possible
Figure: Text line-wise merging
15/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Merging
Text Merging : Algorithm
Complexity is polynomial O(n3), but ... in practice (avg 200 words/doc)
=> the merging is extremely fast (<1s on my laptop)
16/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
Inpainting
Figure: Inpainting process 17/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
DeepFill v2 Idea
We used a SOTA model in Free-shaped Inpainting developed by Adobe Research Team in [3],
we’ve chosen a model pretrained on a dataset called Places2, since Natural places have big
parts sharing the same texture distribution and spatial information like documents
Figure: DeepFill V2 SOTA results
Idea
they proposed different solutions for generative inpainting problems:
• Custom Gated Convolutions : Vanilla convolutions applied to an image with a hole is
meaningless.Solution => Learnable Mask
• Free-shaped : local and global GANs are adapted for rectangular shape. Solution =>
SN-PatchGAN. 18/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
Inpainting : DeepFill v2 Model
19/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
Inpainting : DeepFill v2 Data and Model
Data
• the input X (for both Generator and Discriminator) shape is (N, h, w, 5) where N is
number of images, h is the height and w is the width of the image respectively, input
channels are R, G, B, Holes Mask, User-guidance Mask (not required)
• outputs for the The generator are hole-generated images of shape (N, h, w, 3)
• outputs for our custom discriminator (elaborated using Spectral Normalization and
Patches fashion) is of shape (N, H
32
, W
32
, 256) storing binary variables (binary
classification fake or real)
Model Details
• Gated Convolutions : Normal convolutions are calculated as the following :
Oy,x =
∑ ∑
W · I
then PartialConv were proposed to take only valid pixels as the following static mask
formula :
Oy,x =
{ ∑ ∑
W ·
(
I ⊙ M
sum(M)
)
, if sum(M) > 0
0, otherwise
20/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
Inpainting : DeepFill V2 Model
Model Details(2)
After that DeepFill v2 authors proposed a generalization of this PartialConv with
learnable-dynamic mask through GatedConv as the following :
Gating y,x =
∑ ∑
Wg · I
Feature y,x =
∑ ∑
Wf · I
Oy,x = ϕ ( Feature y,x) ⊙ σ
(
Gating y,x
)
• SN-PatchGAN : A convolutional network (6 Convs with kernal=5 and stride=2) is used
as the discriminator, where they adapt spectral normalization using the default fast
approximation algorithm of spectral normalization described in SN-GAN
Loss
• Generative Loss (Hinge) : LG = −Ez∼Pz(z) [Dsn(G(z))]
• Discriminative Loss (Custom BCE) :
LDsn = Ex∼Pdata (x) [ReLU (1 − Dsn(x))] + Ez∼Pz(z) [ReLU (1 + Dsn(G(z)))]
21/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Further Improvements
Further Improvements
Protocole
• Preprocessing : endorse the preprocessing step with Accepting natural photos (not
necessarly scanned)and Robusteness to text orientation.
• Dataset : Build a dataset, which we could use for validation, evaluation and if necessary
(continue training in some cases for example text recognition for non latin languages).
• Evaluation metrics : Subsequently, we need to develop annotations and mathematical
solid metrics to estimate the quality of our predictions
• Text characteristics : Add a module of Text Font Recognition that given a text crop
will predict : font family, font-size, bold, italic, text color, underlined, ..., it could be
end-to-end, handcrafted, hybrid between both ...
• Currency Converter : if the document contains money currency (for example bills) we
could propose currency conversion
• Merging algo complexity improvement : explore the nature of the data to not compore
all boxes with all boxes
22/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Conclusion
Conclusion
”We are really satisfied of the first version of this enticing project, we
deeply believe that we can bring this project to life and a add a value to
the society. We look forward to endow the system with all the proposed
improvements during the next versions.”
23/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Conclusion
Merci pour votre attention
Des questions?
24/24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
References
Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region
awareness for text detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 9365–9374, 2019.
J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and
H. Lee, “What is wrong with scene text recognition model
comparisons? dataset and model analysis,” in International
Conference on Computer Vision (ICCV), 2019.
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form
image inpainting with gated convolution,” in Proceedings of the IEEE
International Conference on Computer Vision, pp. 4471–4480, 2019.
24/24

More Related Content

What's hot

Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-python
Joe OntheRocks
 
[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...
[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...
[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...
IJET - International Journal of Engineering and Techniques
 

What's hot (10)

E017263040
E017263040E017263040
E017263040
 
Auto-encoding variational bayes
Auto-encoding variational bayesAuto-encoding variational bayes
Auto-encoding variational bayes
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-python
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayes
 
Multimedia digital images
 Multimedia  digital images Multimedia  digital images
Multimedia digital images
 
Image processing with matlab
Image processing with matlabImage processing with matlab
Image processing with matlab
 
Image compression models
Image compression modelsImage compression models
Image compression models
 
[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...
[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...
[IJET-V1I5P13] Authors: Ayesha Shaikh, Antara Kanade,Mabel Fernandes, Shubhan...
 
Computer graphics mcq question bank
Computer graphics mcq question bankComputer graphics mcq question bank
Computer graphics mcq question bank
 
3 d graphics with opengl part 1
3 d graphics with opengl part 13 d graphics with opengl part 1
3 d graphics with opengl part 1
 

Similar to The wise doc_trans presentation

Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator Report
Raouf KESKES
 
Lecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdfLecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdf
ssuserff72e4
 
Iaetsd traffic sign recognition for advanced driver
Iaetsd traffic sign recognition for  advanced driverIaetsd traffic sign recognition for  advanced driver
Iaetsd traffic sign recognition for advanced driver
Iaetsd Iaetsd
 
11.secure compressed image transmission using self organizing feature maps
11.secure compressed image transmission using self organizing feature maps11.secure compressed image transmission using self organizing feature maps
11.secure compressed image transmission using self organizing feature maps
Alexander Decker
 
Questions On The Equation For Regression
Questions On The Equation For RegressionQuestions On The Equation For Regression
Questions On The Equation For Regression
Tiffany Sandoval
 

Similar to The wise doc_trans presentation (20)

Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator Report
 
U_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in DepthU_N.o.1T: A U-Net exploration, in Depth
U_N.o.1T: A U-Net exploration, in Depth
 
Lecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdfLecture1_computer vision-2023.pdf
Lecture1_computer vision-2023.pdf
 
Visual Techniques
Visual TechniquesVisual Techniques
Visual Techniques
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Digital Fabrication Studio.03 _Software @ Aalto Media Factory
Digital Fabrication Studio.03 _Software @ Aalto Media FactoryDigital Fabrication Studio.03 _Software @ Aalto Media Factory
Digital Fabrication Studio.03 _Software @ Aalto Media Factory
 
Efficient document compression using intra frame prediction tecthnique
Efficient document compression using intra frame prediction tecthniqueEfficient document compression using intra frame prediction tecthnique
Efficient document compression using intra frame prediction tecthnique
 
Iaetsd traffic sign recognition for advanced driver
Iaetsd traffic sign recognition for  advanced driverIaetsd traffic sign recognition for  advanced driver
Iaetsd traffic sign recognition for advanced driver
 
Easy edd phd talks 28 oct 2008
Easy edd phd talks 28 oct 2008Easy edd phd talks 28 oct 2008
Easy edd phd talks 28 oct 2008
 
Executing Boolean Queries on an Encrypted Bitmap Index
Executing Boolean Queries on an Encrypted Bitmap IndexExecuting Boolean Queries on an Encrypted Bitmap Index
Executing Boolean Queries on an Encrypted Bitmap Index
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Introduction to Computer graphics
Introduction to Computer graphicsIntroduction to Computer graphics
Introduction to Computer graphics
 
11.secure compressed image transmission using self organizing feature maps
11.secure compressed image transmission using self organizing feature maps11.secure compressed image transmission using self organizing feature maps
11.secure compressed image transmission using self organizing feature maps
 
Ics2311 l02 Graphics fundamentals
Ics2311 l02 Graphics fundamentalsIcs2311 l02 Graphics fundamentals
Ics2311 l02 Graphics fundamentals
 
Questions On The Equation For Regression
Questions On The Equation For RegressionQuestions On The Equation For Regression
Questions On The Equation For Regression
 
OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon Valley
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
 

More from Raouf KESKES

Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES
 
Multi-label Unbalanced Deezer Streaming Classification Report
Multi-label Unbalanced Deezer Streaming Classification  ReportMulti-label Unbalanced Deezer Streaming Classification  Report
Multi-label Unbalanced Deezer Streaming Classification Report
Raouf KESKES
 
Multi Label Deezer Streaming Classification
Multi Label Deezer Streaming ClassificationMulti Label Deezer Streaming Classification
Multi Label Deezer Streaming Classification
Raouf KESKES
 
Machine Learning Interpretability / Explainability
Machine Learning Interpretability / ExplainabilityMachine Learning Interpretability / Explainability
Machine Learning Interpretability / Explainability
Raouf KESKES
 
Reds interpretability report
Reds interpretability reportReds interpretability report
Reds interpretability report
Raouf KESKES
 

More from Raouf KESKES (8)

Master thesis
Master thesisMaster thesis
Master thesis
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Multi-label Unbalanced Deezer Streaming Classification Report
Multi-label Unbalanced Deezer Streaming Classification  ReportMulti-label Unbalanced Deezer Streaming Classification  Report
Multi-label Unbalanced Deezer Streaming Classification Report
 
Multi Label Deezer Streaming Classification
Multi Label Deezer Streaming ClassificationMulti Label Deezer Streaming Classification
Multi Label Deezer Streaming Classification
 
Machine Learning Interpretability / Explainability
Machine Learning Interpretability / ExplainabilityMachine Learning Interpretability / Explainability
Machine Learning Interpretability / Explainability
 
Reds interpretability report
Reds interpretability reportReds interpretability report
Reds interpretability report
 
Reds presentation ml_interpretability_raouf_aurelia
Reds presentation ml_interpretability_raouf_aureliaReds presentation ml_interpretability_raouf_aurelia
Reds presentation ml_interpretability_raouf_aurelia
 

Recently uploaded

Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
meharikiros2
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 

The wise doc_trans presentation

  • 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator The Wise Document Translator AbdelRaouf KESKES January 26, 2021 1/24
  • 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Outline 1 Introduction and Problem Setting 2 Our Approach Global Perspective Text Detection : CRAFT Text Recognition : STR Text Merging Inpainting 3 Further Improvements 4 Conclusion References 2/24
  • 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Introduction and Problem Setting Introduction and Problem Setting Problem A project where I aim to build a system that converts a document from a language to another keeping all the design (Layout, Logo, Sign, ...) Why • Price : Real translators are very expensive and make simple translation templates that never match the input layout • Credibility : it adds a kind of credibility and trustworthiness when we see that the translation match exactly the input layout • Importance degree : Documents have different degree of importance, translating a worksheet is not equivalent to translating a criminal record of a person 3/24
  • 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Introduction and Problem Setting Introduction and Problem Setting Figure: Document translation example 4/24
  • 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Global Perspective Our Approach After Answering the (What?) and the Why?, Let’s answer the How? ... How? We divided our problem to 5 sequential sub-problems : • 1) Text Detection : Localizing word-wise text areas • 2) Text Recognition : recognizing the words • 3) Text Merging : merging words to create statements • 4) Inpainting : Delete the text areas and fill them • 5) Text Translation : Translate the text and put it back in the documents 5/24
  • 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Detection : CRAFT Text Detection We used a SOTA model in Text Detection which is called CRAFT (Character Region Awareness for Text Detection) [1] : best results, multilingual, open source code, documentation, ... Figure: Text Detection 6/24
  • 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Detection : CRAFT CRAFT : Idea and Data Idea exploring characters and affinity between them to form a text entity Data X are input images whose shape is (N, h, w, 3), and Y are outputs whose shape is (N, h, w, 2) where : ∗ N is number of images, h is the height and w is the width of the image respectively ∗ for each image Xi of shape (h, w, 3) we have an Yi representing two matrices : characters score heatmap and affinity/linkage score heatmap Figure: CRAFT data sample 7/24
  • 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Detection : CRAFT CRAFT : Data Synthetic data : where we have characters level annotations Figure: Synthetic data annotations generation process 8/24
  • 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Detection : CRAFT CRAFT : Data Real data : where we have word-level annotations Figure: Real data annotations generation process Figure: Zoom in the splitting characters and inverse projection steps 9/24
  • 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Detection : CRAFT CRAFT : Model + Loss VGG in a U-Net fashion with skip connections Figure: Schematic illustration of the model Architecture Loss : Weighted pixel-wise MSE Sc(p) for synthetic data we obviously set it to 1 10/24
  • 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Detection : CRAFT CRAFT : Post Processing From the 2 predicted heatmaps Sr and Sa we want to generate bounding boxes for words and it is done as the following : • We build a binary mask M where a pixel p is set to 1 if (Sr(p) > thr) or (Sa(p) > tha) where thr and tha are thresholding hyper parameters. • We apply CCL(Connected Component Labeling) algorithm on M • We find Min Area Rectangle covering each component (rotated rectangles are accepted too since the text could be inclined/rotated ) 11/24
  • 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Recognition : STR Text Recognition We used a generalized SOTA model in Text Recognition from the same team CLOVAAI called STR (Scene Text Recognition) [2] Figure: Text Recognition 12/24
  • 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Recognition : STR STR : Idea and Data Idea After showing the fallacies and the inconsistencies raising from STR datasets and unfair perfomances comparison and benchmarks.They proposed a four stages unified framework laveraging previous work and also going beyond by exploring their variants on a granular way and also general way (module wise) combinations Figure: Four stages STR Model Data X are input images whose shape is (N, h, w, 3), where N is number of images, h is the height and w is the width of the image respectively, Y are ground truth words. 6 datasets: MJSynth, SynthText, IC13, IC15, IIIT, and SVT. 13/24
  • 14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Recognition : STR STR : Model The four stages are ... Transformation: with arbitrarily shaped and curvy texts, the STR network applies a Thin-Plate Spline (TPS) transformation and normalizes the input text into a rectangular shape. Feature Extraction: The transformed images is mapped to a set of features relevant to character recognition.The authors carried out experiments on different backbones, namely ResNet, VGG, RCNN. Sequence Modeling: We use biLSTMs to capture contextual information (ba? => ba”d”, ba”g”, ba”t”) . However, BiLSTMs suffers from memory computations cost, so this stage can be selected or deselected as per user need. Prediction: This stage estimates the output character sequence from the identified features of an image.2 options : CTC or Attention 14/24
  • 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Merging Text Merging since row/sentence merging level is sufficient for translation and to keep the main project goal of the document design possible Figure: Text line-wise merging 15/24
  • 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Text Merging Text Merging : Algorithm Complexity is polynomial O(n3), but ... in practice (avg 200 words/doc) => the merging is extremely fast (<1s on my laptop) 16/24
  • 17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Inpainting Inpainting Figure: Inpainting process 17/24
  • 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Inpainting DeepFill v2 Idea We used a SOTA model in Free-shaped Inpainting developed by Adobe Research Team in [3], we’ve chosen a model pretrained on a dataset called Places2, since Natural places have big parts sharing the same texture distribution and spatial information like documents Figure: DeepFill V2 SOTA results Idea they proposed different solutions for generative inpainting problems: • Custom Gated Convolutions : Vanilla convolutions applied to an image with a hole is meaningless.Solution => Learnable Mask • Free-shaped : local and global GANs are adapted for rectangular shape. Solution => SN-PatchGAN. 18/24
  • 19. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Inpainting Inpainting : DeepFill v2 Model 19/24
  • 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Inpainting Inpainting : DeepFill v2 Data and Model Data • the input X (for both Generator and Discriminator) shape is (N, h, w, 5) where N is number of images, h is the height and w is the width of the image respectively, input channels are R, G, B, Holes Mask, User-guidance Mask (not required) • outputs for the The generator are hole-generated images of shape (N, h, w, 3) • outputs for our custom discriminator (elaborated using Spectral Normalization and Patches fashion) is of shape (N, H 32 , W 32 , 256) storing binary variables (binary classification fake or real) Model Details • Gated Convolutions : Normal convolutions are calculated as the following : Oy,x = ∑ ∑ W · I then PartialConv were proposed to take only valid pixels as the following static mask formula : Oy,x = { ∑ ∑ W · ( I ⊙ M sum(M) ) , if sum(M) > 0 0, otherwise 20/24
  • 21. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Our Approach Inpainting Inpainting : DeepFill V2 Model Model Details(2) After that DeepFill v2 authors proposed a generalization of this PartialConv with learnable-dynamic mask through GatedConv as the following : Gating y,x = ∑ ∑ Wg · I Feature y,x = ∑ ∑ Wf · I Oy,x = ϕ ( Feature y,x) ⊙ σ ( Gating y,x ) • SN-PatchGAN : A convolutional network (6 Convs with kernal=5 and stride=2) is used as the discriminator, where they adapt spectral normalization using the default fast approximation algorithm of spectral normalization described in SN-GAN Loss • Generative Loss (Hinge) : LG = −Ez∼Pz(z) [Dsn(G(z))] • Discriminative Loss (Custom BCE) : LDsn = Ex∼Pdata (x) [ReLU (1 − Dsn(x))] + Ez∼Pz(z) [ReLU (1 + Dsn(G(z)))] 21/24
  • 22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Further Improvements Further Improvements Protocole • Preprocessing : endorse the preprocessing step with Accepting natural photos (not necessarly scanned)and Robusteness to text orientation. • Dataset : Build a dataset, which we could use for validation, evaluation and if necessary (continue training in some cases for example text recognition for non latin languages). • Evaluation metrics : Subsequently, we need to develop annotations and mathematical solid metrics to estimate the quality of our predictions • Text characteristics : Add a module of Text Font Recognition that given a text crop will predict : font family, font-size, bold, italic, text color, underlined, ..., it could be end-to-end, handcrafted, hybrid between both ... • Currency Converter : if the document contains money currency (for example bills) we could propose currency conversion • Merging algo complexity improvement : explore the nature of the data to not compore all boxes with all boxes 22/24
  • 23. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Conclusion Conclusion ”We are really satisfied of the first version of this enticing project, we deeply believe that we can bring this project to life and a add a value to the society. We look forward to endow the system with all the proposed improvements during the next versions.” 23/24
  • 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator Conclusion Merci pour votre attention Des questions? 24/24
  • 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wise Document Translator References Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9365–9374, 2019. J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in International Conference on Computer Vision (ICCV), 2019. J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4471–4480, 2019. 24/24