Ts2 c topic

IET Image Processing
Research Article
Detected text-based image retrieval approach
for textual images
ISSN 1751-9659
Received on 9th April 2018
Revised 22nd November 2018
Accepted on 3rd December 2018
E-First on 31st January 2019
doi: 10.1049/iet-ipr.2018.5277
www.ietdl.org
Salahuddin Unar1, Xingyuan Wang1,2 , Chuan Zhang1, Chunpeng Wang3
1
Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, People's Republic of China
2
School of Information Science and Technology, Dalian Maritime University, Dalian 116026, People's Republic of China
3
School of Information, Qilu University of Technology, Shandong 250353, People's Republic of China
E-mail: wangxy@dlut.edu.cn
Abstract: This work addresses the problem of searching and retrieving similar textual images based on the detected text and
opens the new directions for textual image retrieval. For image retrieval, several methods have been proposed to extract visual
features and social tags; however, to extract embedded and scene text within images and use that text as automatic keywords/
tags is still a young research field for text-based and content-based image retrieval applications. The automatic text detection
retrieval is an emerging technology for robotics and artificial intelligence. In this study, the authors have proposed a novel
approach to detect the text in an image and exploit it as keywords and tags for automatic text-based image retrieval. First, text
regions are detected using maximally stable extremal region algorithm. Second, unwanted false positive text regions are
eliminated based on geometric properties and stroke width transform. Next, the true text regions are proceeded into optical
character recognition for recognition. Third, keywords are formed using a neural probabilistic language model. Finally, the
textual images are indexed and retrieved based on the detected keywords. The experimental results on two benchmark
datasets show the dominancy of text is efficient and valuable for image retrieval specifically for textual images.
1 Introduction
Recent advancement in information technology and digital media,
capturing, and sharing the information (i.e. image, video, and
audio) has significantly increased. It needs an efficient method to
retrieve such information existing in excessive amount. For this
purpose, content-based image retrieval (CBIR) acting as a
backbone in multimedia and computer vision communities since
the last two decades [1–3]. In CBIR, for a given query image, the
system will retrieve a number of similar images to the user from
the database. The resulting images can be similar to the query
image in sense of colour, shape, and texture of objects within the
image under varying conditions and complex background.
Image retrieval is a wide research field that includes several
methods from information retrieval, machine learning, multimedia
research, computer vision, and human–computer interaction. Image
retrieval methods can be classified into two groups: text-based
image retrieval (TBIR) and CBIR. TBIR methods need some
heuristic information in the textual form (i.e. image descriptions
and tags) for each image, and then the indexing and retrieval are
performed by the textual queries. Such methods are sufficient for
the limited number of database images with precise tags and
description. However, the limitation with these methods is they
need a huge number of human labour to manually annotate each
image. Nowadays, the images are existing in millions of number,
and it is almost impossible to annotate each image manually. To
overcome such limitations, CBIR methods have been introduced.
CBIR methods describe the images by visual contents (i.e. colour,
shape, and texture) and depend heavily on analysing image
descriptors and similarity measurement.
For a robust CBIR system, the main purpose is to achieve
higher accuracy with minimum computation time. To boost the
retrieval performance, several methods have been introduced to
retrieve similar images from the database [4–6]. However,
researchers have not yet standardised any ideal approach and it
remains a challenging problem. Owing to increment in image data,
simple features such as colour, shape, and texture are not enough
sufficient to construe the image efficiently. The existing methods
mostly focused on extracting visual features such as colour,
texture, shape, and fusing multiple visual descriptors [7–9].
Indexing of similar images is achieved based on these visual
features. However, no standard method has been proposed yet for
retrieving the textual images.
Day by day, the increasing usage of social media sites (e.g.
Instagram, Flickr, and Facebook), millions of people share their
pictures. Mostly, these pictures may contain textual information
that is an additional and clearer clue to perceive the image.
It is a common practise among the people to edit the pictures by
writing inspirational and motivational quotes, as shown in Fig. 1b.
Sometimes, the pictures captured in natural scene environment also
contain textual data under complex background, as shown in
Fig. 1a. Therefore, the text embedded in images might be useful
information for automatic tagging, annotation, and indexing.
Exploiting such information can be used to retrieve similar textual
images to the query image. Consequently, to improve retrieval
accuracy of TBIR and CBIR for the textual images, the detected
text can be an enormous asset to perceive the image more
intensely.
The automatic extraction of textual contents is really a
challenging yet efficient task for several computer vision-based
applications. For example, to help a blind person to read the
contents within an image or to help a tourist to translate the
contents of an image. Retrieving textual contents can be greatly
efficient for robots to perform their specific actions.
In recent years, the problem of text detection and localisation
from the images has gained much attention [10–14]. Several
methods have been proposed to detect the text from the images.
However, their core objective is to detect and localise the text only.
They do not consider the detected text for retrieving the similar
images. We have highlighted some of the well known methods for
text detection given in Table 1.
Most state-of-the-art CBIR methods proposed to explore visual
features. Sometimes the visual features are fused together to
achieve high accuracy. Image indexing is performed based on these
visual features. In [7], Liu et al. proposed colour information
feature (CIF) by adding it with local binary pattern (LBP)-based
feature, as LBP-based feature is not good at capturing rich colour
information sometimes. CIF is capable to describe colour
IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
515

information, image brightness, and colour distribution. Walia and
Pal [8] proposed a fusion framework that combines all the low-
level features by employing colour difference histogram and
angular radial transform. Yang et al. [22] presented a unique
approach based on salient point detector and salient point
expansion using local visual features. These salient points of image
are obtained by using speeded-up robust features detector. To cope
with large size of visual vocabulary, Wang et al. [23] proposed the
hierarchy of medium-sized vocabularies. Sparse representation is
adapted to select specific vocabulary. In [24], Dimitrovski et al.
employed predictive clustering trees to construct indexing structure
for codebook construction. The codebooks can efficiently increase
the specific power of the dictionary. So far, these methods are
developed for visual images (i.e. images containing the colourful
objects) and cannot perform well for textual images.
Moreover, several methods proposed to retrieve the images
based on social tags and keywords. In [25], Li et al. proposed a
model to extract visual, contextual, and semantic features to
identify object and predicting the scene tag importance. However,
sometimes these tags are not true and reflect the emotion only. Wu
et al. [26] introduced a new method for incomplete and missing
social tags. A tag matrix is used for image–tag relation that
searches observed tags and visual similarity. Liu et al. [27]
proposed a novel approach for improving improper social tags.
Their approach represents the consistency of visual and semantic
similarities along with social tags before and after improvements.
Some authors introduced hybrid visual–textual relevance learning
methods. Cui et al. [28] proposed a method based on textual–visual
relevance learning. The method extracts the text from the image
tags and associates the text with visual features. So far, these
methods used visual features and social tags for indexing and
retrieving the similar images and no standard method proposed yet
to detect the text automatically and retrieve the images based on
the detected text.
In this paper, we have proposed a novel approach to retrieve the
similar textual images by detecting embedded and scene text in
textual images. In particular, we use text detection technique and
employ the detected text as keywords and tags for indexing and
retrieving the textual images. The key contributions of this work
are as follows:
• To our best knowledge, there has been no standard method for
textual images retrieval. This work is one of the few innovative
investigations on indexing and retrieving textual images
effectively.
• The proposed method is innovative in dealing with the textual
images consisting the text within images (e.g. quote images,
scene images, individual video frames).
• A fully automatic TBIR method is proposed to retrieve similar
textual images based on the detected text from the complex
background images. The detected text is employed as keywords/
tags for indexing and retrieval.
• The method is robust and efficient to retrieve similar textual
images for the given textual image query, and is based on easy-
to-use framework.
• A new dataset of 1000 images consisting 20 categories is
introduced. The dataset includes quote images, natural scene
images, twitter snapshots, TV channel video frames, and other
textual images.
The rest of this paper is organised as follows: in Section 2, the
proposed method is described briefly. Section 3 presents the
different similarity distance measurements. In Section 4, a new
dataset is introduced and experimental results are evaluated.
Section 5 concludes the proposed approach and states future
directions.
2 Proposed method
In this section, we introduce a novel approach to detect the text,
and employ that text to index and retrieve similar textual images.
First, the candidate text regions are detected using the maximally
stable extremal region (MSER) algorithm. After applying MSER,
several non-text regions may still exist. To remove these non-text
regions, we apply some geometric properties. Further filtration of
non-text regions is carried out using stroke width transform (SWT).
After obtaining positive text regions, we apply bounding boxes for
forming text lines on each textual component. Once the text is
localised and detected, it is faded into optical character recognition
(OCR) for recognition. A neural probabilistic language model
(LM) is employed for forming individual keywords from
recognised text. Four different distance similarity measures
including Euclidean distance, Canberra distance, Manhattan
distance, and Cosine similarity are used to compute the similarity
between the query image and the database images. Finally, top-
rank images are retrieved based on the distance computation. The
schematic illustration of the proposed approach is shown in Fig. 2.
Fig. 1 Sample textual images from datasets
(a) ICDAR 2003 dataset, (b) Sindh dataset
Table 1 Art methods for text detection and localisation from natural scene images
Method Precision Recall F value Features Determination Datasets
Ezaki et al. [15] 60 64 62 connected component
based
text detection in natural scene
images
ICDAR 2003
Zhou et al. [16] 37 88 53 texture based text localisation and classification ICDAR 2003
Zhang and Kasturi [17] 67 46 — edge based text edges detection and extraction ICDAR 2003
Epshtein et al. [18] 73 60 66 stroke based SWTs ICDAR 2003ICDAR 2005
Ma et al. [19] 67 72 — edge and CC based component analysis and edge
detection
ICDAR 2003
Neumann and Matas [20] 59 55 57 texture and edge based text localisation using MSER ICDAR 2003Chars75K
Yi and Tian [21] 71 62 62 connected components
based
text detection in natural scene
images
ICDAR 2003OSTD
International conference on document analysis and recognition (ICDAR)
Oriented scene text dataset (OSTD)
516 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521

2.1 Character candidate extraction
MSER has been identified as one of the best region detector due to
its robustness against scale, viewpoint, and light changes. Several
methods have adapted MSER to extract character candidate and
achieved satisfactory results [29–31]. The main pros of MSER
algorithm over other traditional methods is that the MSER can
detect most of the textual components even with the low-quality
images. Generally, the text has distinct contrast and appearance to
its complex background and comparatively uniform colour
intensity; hence, MSER is the best choice. The proposed method
employs MSER to extract character candidate regions [32].
Let p1, p2, p3, …, pi be the sequence of possible extremal
regions, that is, pi ⊂ pi + 1, pi is an MSER if
v(i) = pi + Δ − pi / pi (1)
If v(i) has a local minimum at i, then pi is one MSER, where Δ is a
parameter. After applying MSER filter, the obtained text regions
are shown in Fig. 3b.
2.2 Non-text objects filtering
After applying MSER, many non-text objects may still exist. We
apply simple geometric properties such as width, height, aspect
ratio etc. to filter out obvious non-text objects. The objects having
maximum and minimum variations are eliminated first. There are
numerous geometric properties that are best to distinguish text and
non-text objects. The proposed method observes some of the
geometric properties described in [33, 34] to eliminate non-text
objects.
Aspect ratio: The aspect ratio can be given as
Aspect_ratio =
max (width, height)
min (width, height)
(2)
We set the limit of character candidates’ aspect ratio between 0.1
and 10. As some characters are very similar such as ‘0’ and ‘O’, ‘i’
and ‘l’; hence, we merge them into one category.
Eccentricity: It is the distance ratio between the concentrations
of the ellipse and its major axis length. It returns a scalar value that
specifies the eccentricity of ellipse that has an equivalent second
moment as the region value. An ellipse having 0 eccentricity is a
circle and 1 eccentricity is a line segment. We set its value >0.995
to form a line segment.
Extent: It returns a scalar that represents the ratio of pixels in
the region to pixels in the total bounding boxes. It can be computed
as total area divided by area of bounding box. We set its value
between 0.2 and 0.9.
Solidity: It returns a scalar that specifies the amount of the
pixels in the convex hull that are also in the region. It can be
computed as the area divided by convex area. We limit its value
<0.3.
Euler number: It returns a scalar that specifies the number of
objects in a region minus the number of holes in the objects. It uses
8-connectivity to compute Euler number. We set its limit <−4.
Size: Character candidates having the size of <5 px are
regretted. As it may contain very limited information which can
lead to time consumption.
Most of the obvious non-text objects are removed after applying
the above geometric properties, as shown in Fig. 4a. Once these
conditions are satisfied, a character candidate can be further
processed to the next step.
2.3 Stroke width filtering
Geometric properties may not fully eliminate the non-text objects.
Another common method used to distinguish text and non-text
objects is stroke width. Stroke width can be defined as the length of
a straight line from a text pixel to another pixel toward its gradient
direction [35]. Several methods adapted stroke width for false
positives elimination as it computes the width of curves and lines
Fig. 2 Schematic illustration of the proposed method
Fig. 3 Textual regions extraction
(a) Original image, (b) Detected MSER regions
Fig. 4 Text detection and localisation
(a) Geometric-based non-text objects filtering, (b) SWT-based non-text objects
filtering
517

which can form a character [18, 36]. Text regions can have less
stroke width variation, while non-text regions can have more
variations. The proposed method follow up SWT to further
eliminate false positives [18]. SWT is a local image operator that
measures width per pixel of most prospective stroke consisting of
the pixel. The size of output image is equal to the size of input
image while each element containing the width of the stroke
associated to the pixel.
First, the initial value for each component of SWT is set to ∞.
A gradient direction dp of each pixel p is measured. If p covers
stroke boundary, then dp also is perpendicular to the direction of
the stroke. The ray r = p + n dp, n > 0 is observed until another
edge pixel q is found and the gradient direction dp is considered at
pixel q. If dq is not obtained according to dp(dq = − dp ± π/6),
then each component of SWT output image consisting the segment
[p, q] is allocated the width of ∥ p − q ∥ until it has low value. The
ray is rejected if q is not found or if dq is not obtained according to
dp.
We filter out the connected components based on criteria ratio
that is stroke width standard deviation (std) to their stroke width
mean. We set the ratio std/mean >0.5 obtained from ICDAR
benchmark [37]. Once the false positives are removed, the true
components are faded to the next step to form the text lines and
word grouping. The attained positive text regions are shown in
Fig. 4b.
2.4 Text line formation
The adjacent character components are grouped together to form a
straight line. To detect these lines, distinct textual components need
to be merged into meaningful words. Character candidates that
belong to the same text line are supposed to have similar properties
(i.e. stroke width, height, size, and intensity). First, the midpoint of
connected components is measured by applying Euclidean distance
D and orientation angle θ between each connected component pair.
In resultant, two maps are obtained, namely distance map and
orientation map. If D < Max Distance, then the two connected
components are supposed to be adjacent characters, where
Max Distance is the maximum Euclidean distance from each
connected component. By assuming that the text generally found in
the horizontal orientation, we set value of θ between −45° and 45°.
Each component pair satisfying the rule is checked by similarity
criteria described in [35]. The components that satisfy the
following criteria are processed further:
wi + wj > 1.3 × D
max (wi/wj, wj/wi) < 5
max (hi/hj, hj/hi) < 2.5
max (si/sj, sj/si) < 1.75
max (ni/nj, nj/ni) < 1.75
where wi, hi, si, ni denotes width, height, stroke width mean value,
and intensity of the bounding box, respectively. The threshold
values can be adjusted according to the experimentation. If a line
contains minimum three textual objects, it is declared to be a text
line. The process ends when no more components can be merged
further. The connected components satisfying the above conditions
are grouped together and the rest of the components are supposed
to be false positives and eliminated. The formed text lines are
shown in Fig. 5a.
Furthermore, the formed text lines are split into individual
words for recognition purpose. We compute the overlap ratio
between all the bounding box pairs by measuring the distance
between the textual component pairs. The proposed method finds
non-zero overlap ratio to locate group of neighbouring text regions.
A threshold T is given as
T = mean(D) + α × std(D) (3)
where D is the distance vector that specifies the horizontal distance
between the components. If the distance exceeds the threshold, it
considers the two components belong to different words and they
will be separated. We limit the value of α to 1.5 and bounding
boxes are applied to each word individually. The applied bounding
boxes are shown in Fig. 5b.
2.5 Text recognition and keywords formation
The true text regions detected in Section 2.4 are faded into OCR
engine. There are several commercial and open source OCR tools
freely available. We adapted Google's open source Tesseract OCR
engine [https://opensource.google.com/projects/tesseract] for text
recognition purpose. The proposed method employs recognised
text words as tags and keywords for indexing the images. The
natural phenomenon is to retrieve similar images based on its text
confidence score and having maximum string match. The text
words having high confidence score will be retrieved first. If no
text is detected in an image, it becomes bit difficult to retrieve the
image. Hence, we set the high recall ratio by increasing the false
positives to get maximum keywords. For keywords formation, the
proposed method employed a neural probabilistic LM that relies on
character-level inputs and its predictions are still realised at word
level [38]. The model is based on a convolutional neural network
and its output from a single layer is used as an input at time t to
recurrent neural network LM.
Given γ as the size of vocabulary of recognised keywords, C is
the vocabulary of characters, d is the dimensionality of each
character, and Q ∈ ℝd × c
is the character matrix. Suppose if word
k ∈ γ containing the combination of characters (c1, c2, …, cl),
where l is the length of word k. Then, the character-level
representation for word k can be given by the matrix Ck
∈ ℝd × l
,
where the jth column is corresponding to character cj. A narrow
convolution between Ck and a filter kernel H ∈ ℝd × w
of width w is
applied.
Then, a bias is added and a non-linearity is applied to attained
feature map f k
∈ ℝl − w + 1
. The ith component of f k
feature map can
be given as
f k
[i] = tanh( Ck
[ ∗ , i:i + w − 1], H + b) (4)
where Ck
[ ∗ , i:i + w − 1] is the i to (i + w − 1)th column of Ck
and
A, B = Tr(ABT
) is Frobenius inner product. Given max-over-time
yk
= max
i
f k
[i] (5)
as feature corresponding to filter H for the word k. For a given
filter, the basic approach is to acquire string having maximum
score. The network exploits several filters with varying width w to
obtain feature vector for each word k. The input of k is given as
yk
= (y1
k
, y2
k
, …, yh
k
) for total H filters H1, H2, …, Hh.
Fig. 5 Text formation
(a) Text line formation, (b) Keywords formation

3 Similarity measure
For an accurate image retrieval system, features extraction and
similarity measurement both plays an important role. Sometimes,
feature extraction is realised smoothly but the similarity
measurement is not chosen perfectly, hence the noisy result is
achieved. The proposed method supports two approaches of
operation: exact substring match and approximate substring match.
Exact substring match retrieves the images having exact matching
words compared with recognised keywords and approximate
substring match retrieves the images with closest matching string.
Generally, the exact substring match mode has the highest priority
over approximate match and will be retrieved first. To compute the
similarity distance between the detected text, we use Euclidean
distance, Canberra distance, Manhattan distance, and Cosine
similarity.
The feature vector of each image in database is given as
FDBi
= {w1, w2, …, wN}, where N is the number of recognised
keywords in an image of database. Feature vector of query image q
is given as Fq = {w1, w2, …, wN}, where N is the number of
recognised keywords in q. The main idea is to select the similar
images from database having maximum matching strings with the
query image. The distance measures are given as follows.
Euclidean distance
D(FDBi
, Fq) = ∑
i = 1
N
(FDBi
− Fq)2
1/2
(6)
Canberra distance
D(FDBi
, Fq) = ∑
i = 1
N
FDBi
− Fq
FDBi
+ Fq
(7)
Manhattan distance
D(FDBi
, Fq) = ∑
i = 1
N
FDBi
− Fq (8)
Cosine similarity
D(FDBi
, Fq) =
FDBi
Fq
∥ FDBi
∥∥ Fq ∥
(9)
where FDBi
is the feature vector of an image in database and Fq is
the feature vector of query image.
4 Experimental results and discussion
In this section, we will briefly present the experimental results and
performance evaluation. All experiments are implemented and
executed on a computer with 8 GB random access memory and
3.10 GHz central processing unit Intel Core-i5-2100.
4.1 Datasets
The experiments are conducted on two benchmark datasets to
ensure the accuracy and robustness of the proposed approach. Both
the datasets containing the textual images. The datasets are given
as follows:
ICDAR 2003: The dataset [37] contains 500 natural scene
images with the varying resolution from 640 × 480 to 1600 × 1200.
Out of which 251 images belongs to TrialTrain set and 249 images
belongs to TrialTest set. The images are captured indoor and
outdoor under the varying conditions (i.e. text size, font, colour,
illumination, and position). The text appeared on signboards,
banners, posters, and other objects.
Sindh: We propose a new dataset, namely Sindh that contains
total 1000 images including quotation images, twitter snapshots,
natural scene images, TV news channel video frames, and other
textual images. The resolution of each image varies from
320 × 240 to 1920 × 1440, and collected randomly from Google,
Instagram, and twitter. We divided these images into 20 different
groups.
4.2 Retrieval performance protocol
The performance accuracy of image retrieval can be computed
using the mean average precision (mAP) that is the average of all
image queries. The AP of top-ranked images is given as
P(Rk) =
number of(relevant images ∩ retrieved images)
number of(retrieved images)
(10)
where Rk is the top retrieved images and we set k = 10 since the
users are more concerned with top-ranked results. The AP value for
a single query is the average of precision value obtained for set of k
images. The AP values are then averaged for all the queries. Given
the set of relevant images for a query qi ∈ Q as {I1, …, Im}, where
Q is the set of all the queries, the mAP can be given as
mAP(Q) =
1
Q ∑
i = 1
Q
1
m ∑
k = 1
m
P(Rk) (11)
4.3 Implementation detail
The proposed method detects the embedded text within images and
uses it as keywords/tags for retrieving the textual images. To
evaluate the performance and efficiency of proposed approach, first
we will perform the experiments for text detection and recognition,
and compare with the art methods. Next, we will evaluate the
proposed method for detected TBIR.
4.3.1 Text detection and recognition: In this section, we
conducted the two experiments: (i) text detection and (ii) end-to-
end text recognition.
Experiment I: For an accurate and robust system, text detection
is the most important task. For this purpose, we perform text
detection evaluation on the benchmark datasets defined in Section
4.1 and compared the results with state-of-the-art methods. We
follow up the standard evaluation protocols of precision p and
recall r stated in [37]. The precision p, recall r, and frequency
measure f are given as
p′ =
Σre ∈ Em(re, T)
E
(12)
r′ =
Σrt ∈ Tm(rt, E)
T
(13)
f =
1
(α/p′) + ((1 − α)/r′)
(14)
where E is the number of total estimated words, T is the ground
truth targets. The frequency measure f is used to combine precision
and recall. The relative weights of precision and recall are
controlled by α. All the performance measures are computed for
each image and then an average result is set for the performance of
the proposed approach. For ICDAR 2003 dataset, the proposed
approach achieved 74% precision and 68% recall values. For Sindh
dataset, the method achieved 75% precision and 70% recall values.
The results demonstrate that the proposed approach outperformed
state-of-the-art methods for precision and f values on ICDAR'03
dataset. For Sindh dataset, the results have low accuracy due to the
high complexity of different categories of images. The obtained
results are given in Tables 2 and 3 for ICDAR 2003 and Sindh
dataset, respectively.
Experiment II: We evaluated the performance for end-to-end
word recognition on ICDAR 2003 and Sindh dataset. There are two
measure metrics for recognition performance: normalised edit
distance and word level recognition. The former is outdated metric
as it can bear partial local error in each word. We use the latter
metric that is quite strict which needs each character recognised
correctly. For word recognition purpose, we again follow up the
519

recognition evaluation protocols defined in [37]. Precision p is the
ratio of total number of words recognised correctly to the total
words recognised by the system. Recall r is the ratio of total
number of words recognised correctly to the total words localised
and detected. If a bounding box overlaps a ground truth bounding
box, it is counted as a match. The overlapping ratio is set to >50%.
Tables 4 and 5 show the performances of different recognition
methods evaluated on ICDAR 2003 and Sindh datasets. The
performance of the proposed method is computed by word level
recognition rate that is commonly used for fair comparison.
4.3.2 Image retrieval performance: To ensure the retrieval
accuracy of the proposed approach, similarity measure is computed
and experiments are conducted on ICDAR 2003 and Sindh dataset.
From ICDAR 2003 dataset, we randomly select 100 images and
use them as query image. Sindh dataset is divided in 20 categories;
we randomly select 10 images from each category and use them as
query image, hence a total 200 images as query. We compute the
precision and recall ratios for each image in the database. Precision
p is the ratio of total number of retrieved relevant images to the
total number of retrieved images. Recall r is the ratio of total
number of retrieved relevant images to the total number of relevant
images from the dataset, where p shows the accuracy of the
retrieval system and r shows the robustness of the retrieval system.
We considered the top retrieved image would be the one having
maximum number of similar words.
In this experiment, when the user provides a textual query
image, the system will automatically detect the text and use it to
index the images. If the image does not contain any text, the
system will add an auxiliary value of ‘1’. Here, the two operations
are performed: exact substring match and approximate substring
match. The exact substring match retrieves the images having exact
same words as compared with the query image. The approximate
substring match retrieves the images having closest confidence
score with the query image. Here, the exact substring match has the
higher priority over approximate match until the k number of
images are retrieved. We perform and compared the image retrieval
for the datasets defined in Section 4.1 on Liu's method [42] that is
only visual-based image retrieval method. Table 6 shows the
obtained retrieval accuracy results for the proposed method. The
results demonstrate that Liu's method could not perform well for
textual images. However, the proposed method performed well
specifically on textual images.
4.4 Retrieval time complexity
For image retrieval, the minimum computation and retrieval time
are curious factors. The computation time and features selection
are reverse to each other. Extraction of additional features can lead
to more time consumption. The proposed methods find a good
balance between text detection and image retrieval. The time
complexity of the proposed approach is given in Table 7 for both
the benchmarks. The results demonstrate that the proposed method
outperformed Liu's method on ICDAR 2003 dataset. However, the
computation time of the proposed method on Sindh dataset is
slightly more as compared with Liu's method due to high complex
background and huge number of small size fonts. Sindh dataset
contains several images of small fonts; hence, it is very
complicated to process the small fonts in an average mean time.
4.5 AP at different distances
For an accurate retrieval system, distance measure computation is
also crucial factor. Different distance measures can lead to different
effects on retrieval results. We computed four different similarity
measures defined in Section 3 to ensure the retrieval accuracy of
the proposed method. Fig. 6 shows the accuracy performance at
different distance measures for k number of images. Results show
that Euclidean distance is performed well as compared with other
distances.
5 Conclusion
In this paper, we have investigated an effective image retrieval
method for textual images based on the embedded and scene text.
First, the proposed method detects the candidate text regions using
the MSER algorithm. The non-text regions are eliminated using the
geometric properties and SWT. The remaining connected
components are grouped together using the bounding boxes. The
detected and localised text regions are faded into OCR engine for
recognising the text. The keywords are formed using a neural
probabilistic LM for image retrieval purpose. Finally, the textual
images are indexed and retrieved based on the detected keywords
using the four different distance measures. To validate the proposed
method for embedded and scene text images, we have offered a
Table 2 Performance comparison of text detection on
ICDAR 2003 dataset
Method Precision Recall f
Neumann and Matas [20] 0.59 0.55 0.57
Li and Lu [35] 0.59 0.59 0.59
Pan et al. [39] 0.66 0.70 0.68
Chen et al. [34] 0.73 0.60 0.66
proposed method 0.74 0.68 0.70
Table 3 Performance evaluation of text detection on Sindh
dataset
Method Precision Recall f
proposed method 0.75 0.70 0.72
Table 4 Performance comparison for end-to-end text
recognition on ICDAR 2003
Method IC03-50 IC03-full
Kai et al. [40] 0.55 0.56
Wang et al. [41] 0.72 0.67
proposed method 0.73 0.69
Table 5 Performance evaluation for end-to-end text
recognition on Sindh
Method Sindh-50 Sindh-full
proposed method 0.67 0.73
Table 6 Retrieval performance evaluation (mAP) on ICDAR
2003 and Sindh
Dataset/method Liu's method [42] Proposed method
ICDAR 2003 0.54 0.63
Sindh 0.59 0.71
Table 7 Retrieval time (s) complexity on ICDAR 2003 and
Sindh
Dataset/method Liu's method [42] Proposed method
ICDAR 2003 3.46 3.17
Sindh 4.38 4.51
Fig. 6 Performance of top-ranked retrieval images on
(a) ICDAR 2003, (b) Sindh dataset

new dataset containing quote images, twitter snapshots, natural
scene images, TV news channel video frames, and other textual
images. The experimental results on two benchmark datasets show
the effectiveness of the proposed approach for textual images. The
method is robust against varying properties of text such as font,
size, colour, illumination, and orientation in complex background.
In future, we intend to improve the proposed approach by fusing
textual features with visual features for more accurate and efficient
retrieval method.
6 Acknowledgments
This research was supported by the National Natural Science
Foundation of China (Nos. 61672124 and 61370145), the Password
Theory Project of the 13th Five-Year Plan National Cryptography
Development Fund (No MMJJ20170203).
7 References
[1] Wang, X., Wang, Z.: ‘The method for image retrieval based on multi-factors
correlation utilizing block truncation coding’, Pattern Recognit., 2014, 47,
(10), pp. 3293–3303
[2] Fadaei, S., Amirfattahi, R., Ahmadzadeh, M.R.: ‘New content-based image
retrieval system based on optimised integration of DCD, wavelet and curvelet
features’, IET Image Process., 2017, 11, (2), pp. 89–98
[3] Alzu'bi, A., Amira, A., Ramzan, N.: ‘Semantic content-based image retrieval:
a comprehensive study’, J. Vis. Commun. Image Represent., 2015, 32, pp. 20–
54
[4] Jiang, F., Hu, H.-M., Zheng, J., et al.: ‘A hierarchal BoW for image retrieval
by enhancing feature salience’, Neurocomputing, 2016, 175, pp. 146–154
[5] ElAdel, A., Ejbali, R., Zaied, M., et al.: ‘A hybrid approach for content-based
image retrieval based on fast beta wavelet network and fuzzy decision support
system’, Mach. Vis. Appl., 2016, 27, (6), pp. 1–19
[6] Feng, L., Wu, J., Liu, S., et al.: ‘Global correlation descriptor: a novel image
representation for image retrieval’, J. Vis. Commun. Image Represent., 2015,
33, pp. 104–114
[7] Liu, P., Guo, J.M., Chamnongthai, K., et al.: ‘Fusion of color histogram and
LBP-based features for texture image retrieval and classification’, Inf. Sci.
(NY), 2017, 390, pp. 95–111
[8] Walia, E., Pal, A.: ‘Fusion framework for effective color image retrieval’, J.
Vis. Commun. Image Represent., 2014, 25, (6), pp. 1335–1348
[9] Wang, X., Wang, Z.: ‘A novel method for image retrieval based on structure
elements’ descriptor’, J. Vis. Commun. Image Represent., 2013, 24, (1), pp.
63–74
[10] Tang, Y., Wu, X.: ‘Scene text detection and segmentation based on cascaded
convolution neural networks’, IEEE Trans. Image Process., 2017, 26, (3), pp.
1509–1520
[11] Wei, Y., Zhang, Z., Shen, W., et al.: ‘Text detection in scene images based on
exhaustive segmentation’, Signal Process. Image Commun., 2017, 50, pp. 1–8
[12] Unar, S., Jalbani, A.H., Shaikh, M., et al.: ‘A study on text detection and
localization techniques for natural scene images’, Int. J. Comput. Sci. Netw.
Secur., 2018, 18, (1), pp. 99–111
[13] Zheng, Y., Li, Q., Liu, J., et al.: ‘A cascaded method for text detection in
natural scene images’, Neurocomputing, 2017, 238, pp. 307–315
[14] Unar, S., Jalbani, A.H., Jawaid, M.M., et al.: ‘Artificial Urdu text detection
and localization from individual video frames’, Mehran Univ. Res. J. Eng.
Technol., 2018, 37, (2), pp. 429–438
[15] Ezaki, N., Bulacu, M., Schomaker, L.: ‘Text detection from natural scene
images: towards a system for visually impaired persons’. 17th Int. Conf.
Pattern Recognition (ICPR), 2004, vol. 2, pp. 683–686
[16] Zhou, G., Liu, Y., Meng, Q., et al.: ‘Detecting multilingual text in natural
scene’. Proc. 2011 First Int. Symp. Access Spaces ISAS 2011, 2011, pp. 116–
120
[17] Zhang, J., Kasturi, R.: ‘Text detection using edge gradient and graph
spectrum’. Proc. Int. Conf. Pattern Recognition, 2010, pp. 3979–3982
[18] Epshtein, B., Ofek, E., Wexler, Y.: ‘Detecting text in natural scenes with
stroke width transform’. Proc. IEEE Computer Society Conf. Computer
Vision Pattern Recognition, 2010, pp. 2963–2970
[19] Ma, L., Wang, C., Xiao, B.: ‘Text detection in natural images based on multi-
scale edge detection and classification’. 2010 Third Int. Congress Image
Signal Processing, 2010, vol. 4, pp. 1961–1965
[20] Neumann, L., Matas, J.: ‘A method for text localization and recognition in
real-world images’. Lecture Notes in Computer Science (including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
2011 (LNCS, 6494), (PART 3), pp. 770–783
[21] Yi, C., Tian, Y.: ‘Text string detection from natural scenes by structure-based
partition and grouping’, IEEE Trans. Image Process., 2011, 20, (9), pp. 2594–
2605
[22] Yang, H.-Y., Li, Y.-W., Li, W.-Y., et al.: ‘Content-based image retrieval using
local visual attention feature’, J. Vis. Commun. Image Represent., 2014, 25,
(6), pp. 1308–1323
[23] Wang, Y., Cen, Y., Zhao, R., et al.: ‘Separable vocabulary and feature fusion
for image retrieval based on sparse representation’, Neurocomputing, 2017,
236, pp. 14–22
[24] Dimitrovski, I., Kocev, D., Loskovska, S., et al.: ‘Improving bag-of-visual-
words image retrieval with predictive clustering trees’, Inf. Sci. (NY), 2016,
329, pp. 851–865
[25] Li, S., Purushotham, S., Chen, C., et al.: ‘Measuring and predicting tag
importance for image retrieval’, IEEE Trans. Pattern Anal. Mach. Intell.,
2017, 8828, (c), pp. 1–14
[26] Wu, L., Jin, R., Jain, A.K.: ‘Tag completion for image retrieval’, IEEE Trans.
Pattern Anal. Mach. Intell., 2013, 35, (3), pp. 716–727
[27] Liu, D., Wang, M., Yang, L., et al.: ‘Tag quality improvement for social
images’. Proc. 2009 IEEE Int. Conf. Multimedia Expo ICME 2009, 2009, pp.
350–353
[28] Cui, C., Lin, P., Nie, X., et al.: ‘Hybrid textual–visual relevance learning for
content-based image retrieval’, J. Vis. Commun. Image Represent., 2017, 48,
pp. 367–374
[29] Neumann, L., Matas, J.: ‘Real-time scene text localization and recognition’.
Proc. IEEE Computer Society Conf. Computer Vision Pattern Recognition,
2012, pp. 3538–3545
[30] Shi, C., Wang, C., Xiao, B., et al.: ‘Scene text detection using graph model
built upon maximally stable extremal regions’, Pattern Recognit. Lett., 2013,
34, (2), pp. 107–116
[31] Felhi, M., Bonnier, N., Tabbone, S.: ‘A skeleton based descriptor for detecting
text in real scene images’. 2012 21st Int. Conf. Pattern Recognition (ICPR),
2012, pp. 282–285
[32] Matas, J., Chum, O., Urban, M., et al.: ‘Robust wide-baseline stereo from
maximally stable extremal regions’, Image Vis. Comput., 2004, 22, (10), pp.
761–767
[33] Bergasa, L.M., Yebes, J.J.: ‘Text location in complex images’. Int. Conf.
Pattern Recognition (ICPR), 2012, pp. 617–620
[34] Chen, H., Tsai, S.S., Schroth, G., et al.: ‘Robust text detection in natural
images with edge-enhanced maximally stable extremal regions’. 2011 18th
IEEE Int. Conf. Image Processing (ICIP), 2011, pp. 3–6
[35] Li, Y., Lu, H.: ‘Scene text detection via stroke width’. Proc. Int. Conf. Pattern
Recognition, 2012, pp. 681–684
[36] Jung, C., Liu, Q., Kim, J.: ‘A stroke filter and its application to text
localization’, Pattern Recognit. Lett., 2009, 30, (2), pp. 114–122
[37] Lucas, S.M., Panaretos, A., Sosa, L., et al.: ‘ICDAR 2003 robust reading
competitions’. Proc. Int. Conf. Document Analysis Recognition ICDAR,
2003, pp. 682–687
[38] Kim, Y., Jernite, Y., Sontag, D., et al.: ‘Character-aware neural language
models’. Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 2741–2749
[39] Pan, Y.F., Liu, C.L., Hou, X.: ‘Fast scene text localization by learning-based
filtering and verification’. Proc. Int. Conf. Image Processing ICIP, 2010, pp.
2269–2272
[40] Kai, W., Babenko, B., Belongie, S.: ‘End-to-end scene text recognition’. 2011
IEEE Int. Conf. Computer Vision (ICCV), 2011, no. 4, pp. 1457–1464
[41] Wang, T., Wu, D.J., Coates, A, et al.: ‘End-to-end text recognition with
convolutional neural networks’. 21st Int. Conf. Pattern Recognition, 2012, pp.
3304–3308
[42] Liu, G.-H., Yang, J.-Y., Li, Z.: ‘Content-based image retrieval using
computational visual attention model’, Pattern Recognit., 2015, 48, (8), pp.
2554–2566
521

Ts2 c topic

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Ts2 c topic

Similar to Ts2 c topic (20)

More from Harini Vemula

More from Harini Vemula (7)

Recently uploaded

Recently uploaded (20)

Ts2 c topic