Detection & Recognition of Text.pdf

Presentation
on
Detection & Recognition of Text Using YOLO Based
Framework
By
Nisarg Gandhewar
S. R. Tandan
Rohit Miri

Contents
•Introduction
•Motivation
•Challenges
•Review of Literature
• Identified Research Gap
• Objectives
•Methodology
•Result & Discussion
• Future Scope
• Conclusion
•References
2
By Mr Nisarg Gandhewar

Introduction
3
•Increasing use of smart phone in our day to day life to capture images initiates a need to
recognize text from natural images.
•Text in natural scenes exists in almost every phase of our daily life.
•Its an active research topic in the field of computer vision due its real world applications
as driverless car, Industrial automation etc.
•Recognizing texts from natural images is still a difficult task because of series of grand
challenges

Motivation
4
Robot Vision
Self Driving Car
Visual Question Answering Image Annotation and Retrival

Challenges
5
•Text density
•Structure of text: text on a page is structured, mostly in strict rows, while text in
the wild may be sprinkled everywhere, in different rotations.
•Fonts:
•Artifacts: clearly, outdoor pictures are much noisier than the comfortable
scanner.
•Location: some tasks include cropped/centred text, while in others, text may be
located in random locations in the image.

Challenges
•Diversity in fonts, scales, and orientations of text.
6

Challenges
• Complexity and Interference of Backgrounds.
• Imperfect Imaging Conditions
7
•Ignorance of some Text Part

Challenges
8
• Multi-Language
•Robust Reading Competition
It provides a platform which acts as a bridge between document analysis
community and computer vision community.

Review of Literature
9
•Classic Computer Vision Techniques
CC based, Sliding Window Based, Texture Based
•Segmentation Based Techniques
PSENet
•General Object Detection Based Techniques:
SSD, Retina Net, RCNN, Fast RCNN, Faster RCNN, YOLO

Sr
No
Author
Name
Year Proposed Work Remark
1 Wang et al 2019
Wang et al. applied a (CNN) model with SW scheme to
obtain candidate lines of text in given image, and thus
estimate text locations.
Classic Technique
2 Yao et al. 2016
Yao et al. consider text detection as a semantic
segmentation problem. They use a FCN model based on
holistically- nested edge detection (HED) to produce global
maps including information of text region, individual
characters and their relationship
Segmentation Based
Technique
3 Deng et al. 2018
Dan Deng proposed PixelLink an instance segmentation
based technique where text sample is first segregated by
connecting pixels inside the identical instance collectively.
Bounding box for text is then obtained from the
segmentation output directly, exclusive of location
regression.
Segmentation Based
Technique

Sr
No
Author
Name
4 Wang et al 2019
Wang et al. presents PSENet to locate text sample
of random shapes and produce the diverse size of
kernels for every sample of text and steadily enlarge
the minimal scale kernel to the sample of text of
entire shape
Segmentation Based Technique
5 Deng et al. 2019
Linjie Deng et al. proposed a technique rely upon
RetinaNet for arbitrary oriented detection of text,
having aim to incorporate the learning mechanism
borrowed from two stage RCNN structure into the
one stage detector.
Object Detection based Technique
Two stage detector
6 Adarsh et al. 2020
Pranav Adarsh et al. proposed YOLO v3-Tiny one
stage improved model based on YOLO speeds up
object detection while guarantees the precision of
the outcome [2].
Object Detection based Technique
One stage Detector

Sr
No
Author
Name
7 Liu et al. 2016
Presented Single-Shot Detector (SSD) like
architecture is used to extract features and perform
text/non-text prediction as well as link prediction.
Object Detection based
Technique, One stage detector
8 Liao et al. 2017
Minghui Liao et al. present text detection technique
TextBoxes which is fit for identifying text in a sole
network, having no post process excluding non max
suppression.
Object Detection based Technique,
9 Liao et al. 2018
Minghui Liao et al. present TextBoxes++ approach
based on SSD for multi oriented detection of scene
text having both high precision and proficiency.
Object Detection based Technique,
One stage detector

Identified Research Gap
13
•There exist a trade off in speed and precision of result to
discover text in scene pictures.
•The size of created model is big.
•There is scope to improve accuracy.

Objectives
14
•To detect & recognize text from natural scene images under different
conditions.
•To develop an approach exploring multiple real world datasets.
•To read a text of different orientations.
•To explore deep learning framework for detection & recognition of text.
•To improve the speed and accuracy of text detection.
•To reduce the size of model.

15
Methodology For Text Detection

16
Pre-processing Steps
•Image Annotation:
•Image Pre-processing Operations:
 Orienting
 Resizing
 Auto-adjust contrast.
•Image Augmentation:
 Shear with 15 degree angle,
 Brightness with 25% and
 Saturation with 25 %.

17
Model Tuning and HyperParameters
•Yolov4:
Backbone: CSPDarknet53
Neck: Path aggregation network (PANet)
Hyperparameters:
batch=64, subdivisions=32, width=608, height=608, channel=3,
momentum = 0.949, decay = 0.0005, saturation = 1.5, exposure = 1.5, hue=0.1,
learning rate = 0.00261, maximum batches = 4000, and filters=18.
Here we use only first 137 layers out of 162.
•Yolov5:
Backbone: Cross-Stage-Partial-Networks
Neck: Path aggregation network (PANet)
Hyperparameters:
batch=16, subdivisions=32, width=416, height=416, momentum = 0.1,
learning rate = 0.00261, maximum batches = 4000.
Here we use YOLOv5x Pre-trained Weight.

18
Model Tuning and HyperParameters
•Detectron2:
Backbone: Base-RCNN-FPN
Neck: Region-Proposal-Network
Hyperparameters:
batch=64, subdivisions=32, width=416, height=416, channel=3,
momentum = 0.1, learning rate = 0.001, maximum batches = 4000.
Here we use pre-trained weights with X101-FPN model

19
Text Detection Algorithm
Algorithm1 Text Detection
Input: Image (I)
Output: Text Detection
For n =1 to N do
Divide I into G * G grid
For each G
Generate V bounding boxes & Anchor Box
For each V generate confidence score (C)
and Class Probability (P)
IOU = Area of Intersection / Area of Union
If IOU > 0.5
then consider V
NMS (P)
Else ignore V
End
End
End

20
Methodology For Text Recognition

21
Data Preprocessing
•Interpret the image and convert it into a gray-scale image.
•Formulate all images of size (128,32) by utilizing padding.
•Expand image dimension as (128,32,1) to make it compatible with the input
shape of architecture.
•Normalize the image pixel values by dividing it with 255.
To preprocess the output labels use the followings:
•Read the text from the name of the image.
•Encode each character of a word into some numerical value by creating a
function( as „a‟:0, „b‟:1 …….. „z‟:26 etc ).
Let say we are having the word „abab‟ then our encoded label would be [0,1,0,1]

22
Network Architecture
•Input image of height 32 and width 128.
•Here we used seven convolution layers of which 6 are having kernel size (3,3)
and the last one is of size (2.2). And the number of filters is increased from 64 to
512 layer by layer.
•Two max-pooling layers are added with size (2,2) and then two max-pooling
layers of size (2,1) are added to extract features with a larger width to predict
long texts.
• We used batch normalization layers after fifth and sixth convolution layers
which accelerates the training process.
•Then we used two Bidirectional LSTM layers each of which has 128 units.
Loss Function: CTC

Benchmark Dataset Used for Experimentation
23
SVT

Result & Discussion: Text Detection
24
Outcome of text detection
techniques on SVT dataset
Outcome of text detection techniques on
MSRA-TD 500 dataset
P: Precision R:Recall F: F-Measure
Technique P R F
ProposedYOLOv4 87 67 76
Proposed Detectron2 71 55.7 62.42
Proposed YOLOv5 57 63 60
Tian 68 65 66
Zhang 68 53 60
Rong 29 27 28
Gupta 26.20 27.4 26.7
Jaderberg 53.6 62.8 46.8
Kittler 55 81 62
Kasar 70 71 69
Technique P R F
Proposed Detectron2 69 53.8 60.45
Proposed YOLOv5 75 77.5 76.22
Zhou 87.2 67.4 76.08
Zhang 83 67 74
He 77 70 74
Turki 72 79 75.3
Liao 87 73 79
Deng 83 73.2 77.8
Liu 84.5 77.1 80.6

25
Outcome of text detection techniques
on ICDAR 2013 dataset Outcome of text detection techniques
on ICDAR 2003 dataset
Technique P R F
Proposed Detectron2 93.5 77.1 84.10
Proposed YOLOv5 79.6 77.2 78.38
Zhong 93 86.7 89.7
He 92 81 86
Gupta 92 75.5 83
Liao 89 83 86
Zhang 88 74 80
Lyu 92 84.4 88
Technique P R F
Proposed YOLOv5 74.5 86.7 80.13
Kittler 75 89 78
Kasar 72 64 65
Sauvola 65 83 67
Howe 76 84 76

26
Outcome of text detection techniques
on ICDAR 2015 dataset
Technique P R F
Proposed YOLOv5 76 44 55.73
Zhou 83.27 78.3 80.7
He 82 80 81
Liao 87.8 78.5 82.9
Lyu 89.5 79.7 84.3
Zhong 89 83 86
Xie 84 81.9 82.9
Wang 86.9 84.5 85.7

27
Result & Discussion: Text Recognition
Original text = UPSTAIRS
predicted text = UPSTAIRS
original_text = SCIATICA
predicted text = SCIATICA
original_text = camouflagers
predicted text = camouflagers
original_text = ORATES
predicted text = ORATES
original_text = Ukuleles
predicted text = Ukueles
original_text = HILARIOUS
predicted text = WULURIOLS
original_text = Procurable
predicted text = Proturable
original_text = lix
predicted text = lix

Sample Output
28

Sample Output
29

Sample Output
30

Sample Output
31

Sample Output
32

Conclusion
33
•There exist several techniques with a trade off in speed, performance and
accuracy of outcome to identify text in scene images.
•We have introduced a YOLOv4, YOLOv5 and Detectron2 based
framework for text detection by considering cons of existing techniques.
• The performance of proposed framework is validated on datasets like
ICDAR 2015, ICDAR 2013, ICDAR 2003, SVT, MSRA-TD-500 by
considering metrics like precision, recall and F-measure.

Conclusion
34
•Our YOLOv4 based framework shows promising result as compare to
existing techniques of text detection over several benchmark datasets
except ICDAR 2015.
• Our proposed model have overcome the challenges like
• Diversity and Variability of Text in Natural Scenes,
• Complexity and Interference of Backgrounds,
• Imperfect Imaging Conditions,
• Diversity inside Datasets,
• Ignorance of some Text Part,
• Multi-Orientation
• and got optimum results over ICDAR2013 dataset.
• Text recognition framework also attain very good results.

Future Scope
35
•To extend our projected framework for detecting text with curved shape,
with multilingual support.
•To detect text in real time video also.
•To secure our model with respect to different adversarial attacks.

References
36
Adarsh P , Rathi P (2020), “YOLO v3 Tiny- Object Detection and Recognition using one-stage
improved model”, ICACCS.
Bochovsky A, Wang C, Yuan H (2020), “YOLOv4- Optimal Speed and Accuracy of Object-Detection”,
Computer Vision and Pattern Recognition. arXiv:2004.10934 , submitted to Cornell University.
Deng D, Liu H (2018),” Pixel-Link- Detecting Scene-Text via Instance Segmentation”, AAAI.
Deng L, Gong Y, Lu X, Ma Z, Xie M (2019), “STELA: A Real Time Scene-Text-Detector with Learned
Anchor”, Submitted to cornell University, arXiv:1909.07549.
Deng D, Liu H, Li X, Cai D (2018), “PixelLink: detecting scene text via instance segmentation”, In:
Proceedings of association for the advancement of artificial intelligence, pp 1–8.
Dai J, Qi H, Xiong Y, Li Y, Zhang G (2017) Deformable convolutional networks. In: IEEE international
conference on computer vision, pp 764–773.
Gupta A, Vedaldi A, (2016), “Synthetic data for text localisation in natural images”, IEEE Conf. Comp.
Vis. Patt. Recog. (CVPR), pp. 2315–2324.
He T, Huang W, Qiao Y, Yao J (2016), “Accurate text localization in natural image with cascaded
convolutional textnetwork”.,pp 1–10. arXiv :1603.09423.
He W, Zhang X,Yin Y (2017), “Deep direct regression for multi-oriented scene text detection”, arXiv
preprint arXiv:1703.08289, submitted to Cornell University.

He W, Zhang XY, Yin F, Liu CL (2017), “Deep direct regression for multi-oriented scene text
detection”, In: IEEE international conference on computer vision, pp 745–753.
Howe N (2011), “A Laplacian Energy for Document Binarization”, International Conference on
Document Analysis and Recognition.
Huang Z, Zhong Z, Huo Q (2019), “Mask RCNN with Pyramid-Attention-Network for Scene-Text-
Detection”, IEEE Conference on Application of Computer Vision.
Jaderberg M , Simonyan K, Vedaldi A and Zisserman A (2014), “Reading text in the wild with
convolutional neural networks”, arXiv:1412.1842, submitted to Cornell University.
Jiang Y, Zhu X, Wang X, Yang S, Li W (2017), “R2CNN: rotational region cnn for orientation robust
scene text detection”, pp 1–8. arXiv :1706.09579
Jocher G, Stoken A, Borovec J, Laughing C, Hogan A, Wang A, Diaconu L, Poznanski J, Rai P,
Ferriday R, Sullivan T, Xinyu W, (2020), “Ultralytics/YOLOV5: V3.0. [Online]”, Available:
https://github.com/ultralytics/yolov5, doi:10.5281/zenodo.3983579.
Kasar T, Kumar J (2007), “Font and Background Color Independent Text-Binarization”, International
Workshop on Camera Based Document Analysis and Recognition, pp. 3–9.
Kittler J, Illingworth J, and Föglein J (1985), “Threshold Selection based on a Simple Image Statistic”, Computer
Vision Graphics and Image Processing, pages. 125-147
References

References
Liao M, Shi B, Bai X (2018), “TextBoxes++: A Single-Shot Oriented Scene-Text-Detector”,
arXiv:1801.02765 , Submitted to Cornell University.
Liao M, Wan Z, Yao C (2020), “Real-time Scene-Text-Detection with Differentiable Binarization”,
Conference on Artificial Intelligence.
Liao M, Shi B, Bai X, Wang X, Liu W (2017), “TextBoxes a fast text detector with a single deep neural
network”, In: Proceedings of association for the advancement of artificial intelligence,pp 1–7
Liao M, Zhu Z, Shi B, Xia G, Bai X (2018), “Rotation-sensitive regression for oriented scene text
detection”, In: IEEE conference on computer vision and pattern recognition, pp 1–10
Lin H, Yang P (2019),”Review of Scene-Text-Detection and Recognition”, Archives of Computational
Methods in Engg.
Liu X, Meng G, Pan C (2019), “Scene-text-detection and recognition with advances in deep-learning: a
survey”, IJDAR.
Liu Y, Zhang L, Luo C, Zhang S (2019), “Curved scene text detection via transverse and longitudinal
sequence connection”,. Pattern Recognition Journal.
Liu Y, Jin L (2017), “Deep matching prior network toward tighter multi-oriented text detection”, In: IEEE
conference on computer vision and pattern recognition, pp 3454–3461
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016), “SSD: single shot multibox detector. In”,
European conference on computer vision, pp 21–37
Li X, Wang W, Hou W, Liu RZ, Lu T (2018), “Shape robust text detection with progressive scale expansion
network”, pp 1–12. arXiv :1806.02559
Lyu P, Yao C, Wu W, Yan S, Bai X (2018), “Multi-oriented scene text detection via corner localization and
region segmentation”, IEEE Conf. Comp. Vis. Patt. Recog. (CVPR).
Ma J, Shao W, Ye H, Wang L, Wang H (2017), “Arbitrary-oriented scene text detection via rotation
proposals” IEEE Trans Multimed 20:1–9 39.

Qin S, Manduchi R (2017), “Cascaded segmentation-detection networks for word-level text
spotting”, In: International conference on document analysis and recognition, pp 1275–1282.
Rong X, Yi C, Tian Y (2017), “Unambiguous text-localization and retrieval for cluttered scenes”, In
CVPR, pp. 3279–3287.
Sauvola J and Pietikäinen M (2000), “Adaptive document image binarization”, Pattern
Recognition, Volume 33, Issue 2.
Shi, X. Bai, and C. Yao (2016), “An end-to-end trainable neural network for image-based sequence
recognition and its application to scene text recognition”, IEEE Trans. Pattern Anal. Mach. Intell..
Shi, X. Wang, P. Lv, C. Yao, and X. Bai (2016), “Robust scene text recognition with automatic
rectification”, In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
Shi B, Bai X, Belongie S (2017), “Detecting oriented text in natural images by linking segments”,
In: IEEE conference on computer vision and pattern recognition, pp 2482–3490.
Tian Z , Shu M , Lyu P, Li R, Zhou C (2019), “Learning Shape-Aware Embedding for Scene Text
Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tian Z, Huang W, He T, He P, Qiao Y (2016), “Detecting text in natural image with connectionist
text proposal network”, In: European conference on computer vision, pp 56–72.
Wang W, Xie E, Zang Y, Lu T (2019), “Efficient and Accurate Arbitrary-Shaped Text-Detection with
Pixel Aggregation Network”, ICCV.
References

Wu Y, Kirillov A, Massa F, Lo W, Girshick R (2019), “Detectron2”, url: https://github.com
/facebookresearch/ detectron2.
Xue C, Lu S, Zhan F (2018), “Accurate Scene-Text-Detection through Border Semantics
Awareness and Bootstrapping”, ECCV.
Yang Q, Cheng M, Zhou W, Chen Y, Qiu M (2018), “IncepText: a new inception-text
module with deformable psroi pooling for multi-oriented scene text detection”, In:
International joint conference on artificial intelligence, pp 1–7
Yao C, Bai X, Sang N, Zhou X, Zhou S (2016), “SceneText detection via holistic, multi-
channel prediction”, arXiv :1606.09002 pp 1–10
Zhang Z , Shen W, Yao C (2015), “Symmetry-based text line detection in natural scenes”, In CVPR,
pp. 2558–2567.
Zhang Z, Zhang C, Shen W, Yao C, Liu W (2016), “Multioriented text detection with fully convolutional
networks”, In:Computer vision & pattern recognition, pp 4159–4167.
Zhong Z, Sun L, Huo Q (2017), “Improved localization accuracy by locnet for faster r-cnn based text
detection”, International Conference on Document Analysis and Recognition.
Zhong Z, Jin L, Zhang S, Feng Z (2016), “DeepText a unified framework for text proposal generation
and text detection in natural images”, pp 1–12. arXiv :1605.07314 v1.
Zhou X, Yao C, Wen H, Wang Y, Zhou S (2017), “EAST an efficient and accurate scene text detector”,
In: IEEE conference on computer vision and pattern recognition, pp 2642–2651.
References

Thanks
By
Nisarg Gandhewar
S. R. Tandan
Rohit Miri
41

Detection & Recognition of Text.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Detection & Recognition of Text.pdf

Similar to Detection & Recognition of Text.pdf (20)

Recently uploaded

Recently uploaded (20)

Detection & Recognition of Text.pdf