3. Introduction
3
By Mr Nisarg Gandhewar
•Increasing use of smart phone in our day to day life to capture images initiates a need to
recognize text from natural images.
•Text in natural scenes exists in almost every phase of our daily life.
•Its an active research topic in the field of computer vision due its real world applications
as driverless car, Industrial automation etc.
•Recognizing texts from natural images is still a difficult task because of series of grand
challenges
5. Challenges
5
•Text density
•Structure of text: text on a page is structured, mostly in strict rows, while text in
the wild may be sprinkled everywhere, in different rotations.
•Fonts:
•Artifacts: clearly, outdoor pictures are much noisier than the comfortable
scanner.
•Location: some tasks include cropped/centred text, while in others, text may be
located in random locations in the image.
By Mr Nisarg Gandhewar
7. Challenges
• Complexity and Interference of Backgrounds.
• Imperfect Imaging Conditions
7
•Ignorance of some Text Part
By Mr Nisarg Gandhewar
8. Challenges
8
• Multi-Language
•Robust Reading Competition
It provides a platform which acts as a bridge between document analysis
community and computer vision community.
By Mr Nisarg Gandhewar
9. Review of Literature
9
•Classic Computer Vision Techniques
CC based, Sliding Window Based, Texture Based
•Segmentation Based Techniques
PSENet
•General Object Detection Based Techniques:
SSD, Retina Net, RCNN, Fast RCNN, Faster RCNN, YOLO
By Mr Nisarg Gandhewar
10. Review of Literature
Sr
No
Author
Name
Year Proposed Work Remark
1 Wang et al 2019
Wang et al. applied a (CNN) model with SW scheme to
obtain candidate lines of text in given image, and thus
estimate text locations.
Classic Technique
2 Yao et al. 2016
Yao et al. consider text detection as a semantic
segmentation problem. They use a FCN model based on
holistically- nested edge detection (HED) to produce global
maps including information of text region, individual
characters and their relationship
Segmentation Based
Technique
3 Deng et al. 2018
Dan Deng proposed PixelLink an instance segmentation
based technique where text sample is first segregated by
connecting pixels inside the identical instance collectively.
Bounding box for text is then obtained from the
segmentation output directly, exclusive of location
regression.
Segmentation Based
Technique
By Mr Nisarg Gandhewar
11. Review of Literature
Sr
No
Author
Name
Year Proposed Work Remark
4 Wang et al 2019
Wang et al. presents PSENet to locate text sample
of random shapes and produce the diverse size of
kernels for every sample of text and steadily enlarge
the minimal scale kernel to the sample of text of
entire shape
Segmentation Based Technique
5 Deng et al. 2019
Linjie Deng et al. proposed a technique rely upon
RetinaNet for arbitrary oriented detection of text,
having aim to incorporate the learning mechanism
borrowed from two stage RCNN structure into the
one stage detector.
Object Detection based Technique
Two stage detector
6 Adarsh et al. 2020
Pranav Adarsh et al. proposed YOLO v3-Tiny one
stage improved model based on YOLO speeds up
object detection while guarantees the precision of
the outcome [2].
Object Detection based Technique
One stage Detector
By Mr Nisarg Gandhewar
12. Review of Literature
Sr
No
Author
Name
Year Proposed Work Remark
7 Liu et al. 2016
Presented Single-Shot Detector (SSD) like
architecture is used to extract features and perform
text/non-text prediction as well as link prediction.
Object Detection based
Technique, One stage detector
8 Liao et al. 2017
Minghui Liao et al. present text detection technique
TextBoxes which is fit for identifying text in a sole
network, having no post process excluding non max
suppression.
Object Detection based Technique,
9 Liao et al. 2018
Minghui Liao et al. present TextBoxes++ approach
based on SSD for multi oriented detection of scene
text having both high precision and proficiency.
Object Detection based Technique,
One stage detector
By Mr Nisarg Gandhewar
13. Identified Research Gap
13
•There exist a trade off in speed and precision of result to
discover text in scene pictures.
•The size of created model is big.
•There is scope to improve accuracy.
By Mr Nisarg Gandhewar
14. Objectives
14
•To detect & recognize text from natural scene images under different
conditions.
•To develop an approach exploring multiple real world datasets.
•To read a text of different orientations.
•To explore deep learning framework for detection & recognition of text.
•To improve the speed and accuracy of text detection.
•To reduce the size of model.
By Mr Nisarg Gandhewar
16. 16
Pre-processing Steps
•Image Annotation:
•Image Pre-processing Operations:
Orienting
Resizing
Auto-adjust contrast.
•Image Augmentation:
Shear with 15 degree angle,
Brightness with 25% and
Saturation with 25 %.
By Mr Nisarg Gandhewar
17. 17
Model Tuning and HyperParameters
•Yolov4:
Backbone: CSPDarknet53
Neck: Path aggregation network (PANet)
Hyperparameters:
batch=64, subdivisions=32, width=608, height=608, channel=3,
momentum = 0.949, decay = 0.0005, saturation = 1.5, exposure = 1.5, hue=0.1,
learning rate = 0.00261, maximum batches = 4000, and filters=18.
Here we use only first 137 layers out of 162.
•Yolov5:
Backbone: Cross-Stage-Partial-Networks
Neck: Path aggregation network (PANet)
Hyperparameters:
batch=16, subdivisions=32, width=416, height=416, momentum = 0.1,
learning rate = 0.00261, maximum batches = 4000.
Here we use YOLOv5x Pre-trained Weight.
By Mr Nisarg Gandhewar
18. 18
Model Tuning and HyperParameters
•Detectron2:
Backbone: Base-RCNN-FPN
Neck: Region-Proposal-Network
Hyperparameters:
batch=64, subdivisions=32, width=416, height=416, channel=3,
momentum = 0.1, learning rate = 0.001, maximum batches = 4000.
Here we use pre-trained weights with X101-FPN model
By Mr Nisarg Gandhewar
19. 19
Text Detection Algorithm
Algorithm1 Text Detection
Input: Image (I)
Output: Text Detection
For n =1 to N do
Divide I into G * G grid
For each G
Generate V bounding boxes & Anchor Box
For each V generate confidence score (C)
and Class Probability (P)
IOU = Area of Intersection / Area of Union
If IOU > 0.5
then consider V
NMS (P)
Else ignore V
End
End
End
By Mr Nisarg Gandhewar
21. 21
Data Preprocessing
•Interpret the image and convert it into a gray-scale image.
•Formulate all images of size (128,32) by utilizing padding.
•Expand image dimension as (128,32,1) to make it compatible with the input
shape of architecture.
•Normalize the image pixel values by dividing it with 255.
To preprocess the output labels use the followings:
•Read the text from the name of the image.
•Encode each character of a word into some numerical value by creating a
function( as „a‟:0, „b‟:1 …….. „z‟:26 etc ).
Let say we are having the word „abab‟ then our encoded label would be [0,1,0,1]
By Mr Nisarg Gandhewar
22. 22
Network Architecture
•Input image of height 32 and width 128.
•Here we used seven convolution layers of which 6 are having kernel size (3,3)
and the last one is of size (2.2). And the number of filters is increased from 64 to
512 layer by layer.
•Two max-pooling layers are added with size (2,2) and then two max-pooling
layers of size (2,1) are added to extract features with a larger width to predict
long texts.
• We used batch normalization layers after fifth and sixth convolution layers
which accelerates the training process.
•Then we used two Bidirectional LSTM layers each of which has 128 units.
Loss Function: CTC
By Mr Nisarg Gandhewar
33. Conclusion
33
•There exist several techniques with a trade off in speed, performance and
accuracy of outcome to identify text in scene images.
•We have introduced a YOLOv4, YOLOv5 and Detectron2 based
framework for text detection by considering cons of existing techniques.
• The performance of proposed framework is validated on datasets like
ICDAR 2015, ICDAR 2013, ICDAR 2003, SVT, MSRA-TD-500 by
considering metrics like precision, recall and F-measure.
By Mr Nisarg Gandhewar
34. Conclusion
34
•Our YOLOv4 based framework shows promising result as compare to
existing techniques of text detection over several benchmark datasets
except ICDAR 2015.
• Our proposed model have overcome the challenges like
• Diversity and Variability of Text in Natural Scenes,
• Complexity and Interference of Backgrounds,
• Imperfect Imaging Conditions,
• Diversity inside Datasets,
• Ignorance of some Text Part,
• Multi-Orientation
• and got optimum results over ICDAR2013 dataset.
• Text recognition framework also attain very good results.
By Mr Nisarg Gandhewar
35. Future Scope
35
•To extend our projected framework for detecting text with curved shape,
with multilingual support.
•To detect text in real time video also.
•To secure our model with respect to different adversarial attacks.
By Mr Nisarg Gandhewar
36. References
36
Adarsh P , Rathi P (2020), “YOLO v3 Tiny- Object Detection and Recognition using one-stage
improved model”, ICACCS.
Bochovsky A, Wang C, Yuan H (2020), “YOLOv4- Optimal Speed and Accuracy of Object-Detection”,
Computer Vision and Pattern Recognition. arXiv:2004.10934 , submitted to Cornell University.
Deng D, Liu H (2018),” Pixel-Link- Detecting Scene-Text via Instance Segmentation”, AAAI.
Deng L, Gong Y, Lu X, Ma Z, Xie M (2019), “STELA: A Real Time Scene-Text-Detector with Learned
Anchor”, Submitted to cornell University, arXiv:1909.07549.
Deng D, Liu H, Li X, Cai D (2018), “PixelLink: detecting scene text via instance segmentation”, In:
Proceedings of association for the advancement of artificial intelligence, pp 1–8.
Dai J, Qi H, Xiong Y, Li Y, Zhang G (2017) Deformable convolutional networks. In: IEEE international
conference on computer vision, pp 764–773.
Gupta A, Vedaldi A, (2016), “Synthetic data for text localisation in natural images”, IEEE Conf. Comp.
Vis. Patt. Recog. (CVPR), pp. 2315–2324.
He T, Huang W, Qiao Y, Yao J (2016), “Accurate text localization in natural image with cascaded
convolutional textnetwork”.,pp 1–10. arXiv :1603.09423.
He W, Zhang X,Yin Y (2017), “Deep direct regression for multi-oriented scene text detection”, arXiv
preprint arXiv:1703.08289, submitted to Cornell University.
By Mr Nisarg Gandhewar
37. He W, Zhang XY, Yin F, Liu CL (2017), “Deep direct regression for multi-oriented scene text
detection”, In: IEEE international conference on computer vision, pp 745–753.
Howe N (2011), “A Laplacian Energy for Document Binarization”, International Conference on
Document Analysis and Recognition.
Huang Z, Zhong Z, Huo Q (2019), “Mask RCNN with Pyramid-Attention-Network for Scene-Text-
Detection”, IEEE Conference on Application of Computer Vision.
Jaderberg M , Simonyan K, Vedaldi A and Zisserman A (2014), “Reading text in the wild with
convolutional neural networks”, arXiv:1412.1842, submitted to Cornell University.
Jiang Y, Zhu X, Wang X, Yang S, Li W (2017), “R2CNN: rotational region cnn for orientation robust
scene text detection”, pp 1–8. arXiv :1706.09579
Jocher G, Stoken A, Borovec J, Laughing C, Hogan A, Wang A, Diaconu L, Poznanski J, Rai P,
Ferriday R, Sullivan T, Xinyu W, (2020), “Ultralytics/YOLOV5: V3.0. [Online]”, Available:
https://github.com/ultralytics/yolov5, doi:10.5281/zenodo.3983579.
Kasar T, Kumar J (2007), “Font and Background Color Independent Text-Binarization”, International
Workshop on Camera Based Document Analysis and Recognition, pp. 3–9.
Kittler J, Illingworth J, and Föglein J (1985), “Threshold Selection based on a Simple Image Statistic”, Computer
Vision Graphics and Image Processing, pages. 125-147
References
By Mr Nisarg Gandhewar
38. References
Liao M, Shi B, Bai X (2018), “TextBoxes++: A Single-Shot Oriented Scene-Text-Detector”,
arXiv:1801.02765 , Submitted to Cornell University.
Liao M, Wan Z, Yao C (2020), “Real-time Scene-Text-Detection with Differentiable Binarization”,
Conference on Artificial Intelligence.
Liao M, Shi B, Bai X, Wang X, Liu W (2017), “TextBoxes a fast text detector with a single deep neural
network”, In: Proceedings of association for the advancement of artificial intelligence,pp 1–7
Liao M, Zhu Z, Shi B, Xia G, Bai X (2018), “Rotation-sensitive regression for oriented scene text
detection”, In: IEEE conference on computer vision and pattern recognition, pp 1–10
Lin H, Yang P (2019),”Review of Scene-Text-Detection and Recognition”, Archives of Computational
Methods in Engg.
Liu X, Meng G, Pan C (2019), “Scene-text-detection and recognition with advances in deep-learning: a
survey”, IJDAR.
Liu Y, Zhang L, Luo C, Zhang S (2019), “Curved scene text detection via transverse and longitudinal
sequence connection”,. Pattern Recognition Journal.
Liu Y, Jin L (2017), “Deep matching prior network toward tighter multi-oriented text detection”, In: IEEE
conference on computer vision and pattern recognition, pp 3454–3461
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016), “SSD: single shot multibox detector. In”,
European conference on computer vision, pp 21–37
Li X, Wang W, Hou W, Liu RZ, Lu T (2018), “Shape robust text detection with progressive scale expansion
network”, pp 1–12. arXiv :1806.02559
Lyu P, Yao C, Wu W, Yan S, Bai X (2018), “Multi-oriented scene text detection via corner localization and
region segmentation”, IEEE Conf. Comp. Vis. Patt. Recog. (CVPR).
Ma J, Shao W, Ye H, Wang L, Wang H (2017), “Arbitrary-oriented scene text detection via rotation
proposals” IEEE Trans Multimed 20:1–9 39.
By Mr Nisarg Gandhewar
39. Qin S, Manduchi R (2017), “Cascaded segmentation-detection networks for word-level text
spotting”, In: International conference on document analysis and recognition, pp 1275–1282.
Rong X, Yi C, Tian Y (2017), “Unambiguous text-localization and retrieval for cluttered scenes”, In
CVPR, pp. 3279–3287.
Sauvola J and Pietikäinen M (2000), “Adaptive document image binarization”, Pattern
Recognition, Volume 33, Issue 2.
Shi, X. Bai, and C. Yao (2016), “An end-to-end trainable neural network for image-based sequence
recognition and its application to scene text recognition”, IEEE Trans. Pattern Anal. Mach. Intell..
Shi, X. Wang, P. Lv, C. Yao, and X. Bai (2016), “Robust scene text recognition with automatic
rectification”, In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
Shi B, Bai X, Belongie S (2017), “Detecting oriented text in natural images by linking segments”,
In: IEEE conference on computer vision and pattern recognition, pp 2482–3490.
Tian Z , Shu M , Lyu P, Li R, Zhou C (2019), “Learning Shape-Aware Embedding for Scene Text
Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tian Z, Huang W, He T, He P, Qiao Y (2016), “Detecting text in natural image with connectionist
text proposal network”, In: European conference on computer vision, pp 56–72.
Wang W, Xie E, Zang Y, Lu T (2019), “Efficient and Accurate Arbitrary-Shaped Text-Detection with
Pixel Aggregation Network”, ICCV.
References
By Mr Nisarg Gandhewar
40. Wu Y, Kirillov A, Massa F, Lo W, Girshick R (2019), “Detectron2”, url: https://github.com
/facebookresearch/ detectron2.
Xue C, Lu S, Zhan F (2018), “Accurate Scene-Text-Detection through Border Semantics
Awareness and Bootstrapping”, ECCV.
Yang Q, Cheng M, Zhou W, Chen Y, Qiu M (2018), “IncepText: a new inception-text
module with deformable psroi pooling for multi-oriented scene text detection”, In:
International joint conference on artificial intelligence, pp 1–7
Yao C, Bai X, Sang N, Zhou X, Zhou S (2016), “SceneText detection via holistic, multi-
channel prediction”, arXiv :1606.09002 pp 1–10
Zhang Z , Shen W, Yao C (2015), “Symmetry-based text line detection in natural scenes”, In CVPR,
pp. 2558–2567.
Zhang Z, Zhang C, Shen W, Yao C, Liu W (2016), “Multioriented text detection with fully convolutional
networks”, In:Computer vision & pattern recognition, pp 4159–4167.
Zhong Z, Sun L, Huo Q (2017), “Improved localization accuracy by locnet for faster r-cnn based text
detection”, International Conference on Document Analysis and Recognition.
Zhong Z, Jin L, Zhang S, Feng Z (2016), “DeepText a unified framework for text proposal generation
and text detection in natural images”, pp 1–12. arXiv :1605.07314 v1.
Zhou X, Yao C, Wen H, Wang Y, Zhou S (2017), “EAST an efficient and accurate scene text detector”,
In: IEEE conference on computer vision and pattern recognition, pp 2642–2651.
References
By Mr Nisarg Gandhewar