SlideShare a Scribd company logo
1 of 7
Download to read offline
IET Image Processing
Research Article
Detected text-based image retrieval approach
for textual images
ISSN 1751-9659
Received on 9th April 2018
Revised 22nd November 2018
Accepted on 3rd December 2018
E-First on 31st January 2019
doi: 10.1049/iet-ipr.2018.5277
www.ietdl.org
Salahuddin Unar1, Xingyuan Wang1,2 , Chuan Zhang1, Chunpeng Wang3
1
Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, People's Republic of China
2
School of Information Science and Technology, Dalian Maritime University, Dalian 116026, People's Republic of China
3
School of Information, Qilu University of Technology, Shandong 250353, People's Republic of China
E-mail: wangxy@dlut.edu.cn
Abstract: This work addresses the problem of searching and retrieving similar textual images based on the detected text and
opens the new directions for textual image retrieval. For image retrieval, several methods have been proposed to extract visual
features and social tags; however, to extract embedded and scene text within images and use that text as automatic keywords/
tags is still a young research field for text-based and content-based image retrieval applications. The automatic text detection
retrieval is an emerging technology for robotics and artificial intelligence. In this study, the authors have proposed a novel
approach to detect the text in an image and exploit it as keywords and tags for automatic text-based image retrieval. First, text
regions are detected using maximally stable extremal region algorithm. Second, unwanted false positive text regions are
eliminated based on geometric properties and stroke width transform. Next, the true text regions are proceeded into optical
character recognition for recognition. Third, keywords are formed using a neural probabilistic language model. Finally, the
textual images are indexed and retrieved based on the detected keywords. The experimental results on two benchmark
datasets show the dominancy of text is efficient and valuable for image retrieval specifically for textual images.
1 Introduction
Recent advancement in information technology and digital media,
capturing, and sharing the information (i.e. image, video, and
audio) has significantly increased. It needs an efficient method to
retrieve such information existing in excessive amount. For this
purpose, content-based image retrieval (CBIR) acting as a
backbone in multimedia and computer vision communities since
the last two decades [1–3]. In CBIR, for a given query image, the
system will retrieve a number of similar images to the user from
the database. The resulting images can be similar to the query
image in sense of colour, shape, and texture of objects within the
image under varying conditions and complex background.
Image retrieval is a wide research field that includes several
methods from information retrieval, machine learning, multimedia
research, computer vision, and human–computer interaction. Image
retrieval methods can be classified into two groups: text-based
image retrieval (TBIR) and CBIR. TBIR methods need some
heuristic information in the textual form (i.e. image descriptions
and tags) for each image, and then the indexing and retrieval are
performed by the textual queries. Such methods are sufficient for
the limited number of database images with precise tags and
description. However, the limitation with these methods is they
need a huge number of human labour to manually annotate each
image. Nowadays, the images are existing in millions of number,
and it is almost impossible to annotate each image manually. To
overcome such limitations, CBIR methods have been introduced.
CBIR methods describe the images by visual contents (i.e. colour,
shape, and texture) and depend heavily on analysing image
descriptors and similarity measurement.
For a robust CBIR system, the main purpose is to achieve
higher accuracy with minimum computation time. To boost the
retrieval performance, several methods have been introduced to
retrieve similar images from the database [4–6]. However,
researchers have not yet standardised any ideal approach and it
remains a challenging problem. Owing to increment in image data,
simple features such as colour, shape, and texture are not enough
sufficient to construe the image efficiently. The existing methods
mostly focused on extracting visual features such as colour,
texture, shape, and fusing multiple visual descriptors [7–9].
Indexing of similar images is achieved based on these visual
features. However, no standard method has been proposed yet for
retrieving the textual images.
Day by day, the increasing usage of social media sites (e.g.
Instagram, Flickr, and Facebook), millions of people share their
pictures. Mostly, these pictures may contain textual information
that is an additional and clearer clue to perceive the image.
It is a common practise among the people to edit the pictures by
writing inspirational and motivational quotes, as shown in Fig. 1b.
Sometimes, the pictures captured in natural scene environment also
contain textual data under complex background, as shown in
Fig. 1a. Therefore, the text embedded in images might be useful
information for automatic tagging, annotation, and indexing.
Exploiting such information can be used to retrieve similar textual
images to the query image. Consequently, to improve retrieval
accuracy of TBIR and CBIR for the textual images, the detected
text can be an enormous asset to perceive the image more
intensely.
The automatic extraction of textual contents is really a
challenging yet efficient task for several computer vision-based
applications. For example, to help a blind person to read the
contents within an image or to help a tourist to translate the
contents of an image. Retrieving textual contents can be greatly
efficient for robots to perform their specific actions.
In recent years, the problem of text detection and localisation
from the images has gained much attention [10–14]. Several
methods have been proposed to detect the text from the images.
However, their core objective is to detect and localise the text only.
They do not consider the detected text for retrieving the similar
images. We have highlighted some of the well known methods for
text detection given in Table 1.
Most state-of-the-art CBIR methods proposed to explore visual
features. Sometimes the visual features are fused together to
achieve high accuracy. Image indexing is performed based on these
visual features. In [7], Liu et al. proposed colour information
feature (CIF) by adding it with local binary pattern (LBP)-based
feature, as LBP-based feature is not good at capturing rich colour
information sometimes. CIF is capable to describe colour
IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
515
information, image brightness, and colour distribution. Walia and
Pal [8] proposed a fusion framework that combines all the low-
level features by employing colour difference histogram and
angular radial transform. Yang et al. [22] presented a unique
approach based on salient point detector and salient point
expansion using local visual features. These salient points of image
are obtained by using speeded-up robust features detector. To cope
with large size of visual vocabulary, Wang et al. [23] proposed the
hierarchy of medium-sized vocabularies. Sparse representation is
adapted to select specific vocabulary. In [24], Dimitrovski et al.
employed predictive clustering trees to construct indexing structure
for codebook construction. The codebooks can efficiently increase
the specific power of the dictionary. So far, these methods are
developed for visual images (i.e. images containing the colourful
objects) and cannot perform well for textual images.
Moreover, several methods proposed to retrieve the images
based on social tags and keywords. In [25], Li et al. proposed a
model to extract visual, contextual, and semantic features to
identify object and predicting the scene tag importance. However,
sometimes these tags are not true and reflect the emotion only. Wu
et al. [26] introduced a new method for incomplete and missing
social tags. A tag matrix is used for image–tag relation that
searches observed tags and visual similarity. Liu et al. [27]
proposed a novel approach for improving improper social tags.
Their approach represents the consistency of visual and semantic
similarities along with social tags before and after improvements.
Some authors introduced hybrid visual–textual relevance learning
methods. Cui et al. [28] proposed a method based on textual–visual
relevance learning. The method extracts the text from the image
tags and associates the text with visual features. So far, these
methods used visual features and social tags for indexing and
retrieving the similar images and no standard method proposed yet
to detect the text automatically and retrieve the images based on
the detected text.
In this paper, we have proposed a novel approach to retrieve the
similar textual images by detecting embedded and scene text in
textual images. In particular, we use text detection technique and
employ the detected text as keywords and tags for indexing and
retrieving the textual images. The key contributions of this work
are as follows:
• To our best knowledge, there has been no standard method for
textual images retrieval. This work is one of the few innovative
investigations on indexing and retrieving textual images
effectively.
• The proposed method is innovative in dealing with the textual
images consisting the text within images (e.g. quote images,
scene images, individual video frames).
• A fully automatic TBIR method is proposed to retrieve similar
textual images based on the detected text from the complex
background images. The detected text is employed as keywords/
tags for indexing and retrieval.
• The method is robust and efficient to retrieve similar textual
images for the given textual image query, and is based on easy-
to-use framework.
• A new dataset of 1000 images consisting 20 categories is
introduced. The dataset includes quote images, natural scene
images, twitter snapshots, TV channel video frames, and other
textual images.
The rest of this paper is organised as follows: in Section 2, the
proposed method is described briefly. Section 3 presents the
different similarity distance measurements. In Section 4, a new
dataset is introduced and experimental results are evaluated.
Section 5 concludes the proposed approach and states future
directions.
2 Proposed method
In this section, we introduce a novel approach to detect the text,
and employ that text to index and retrieve similar textual images.
First, the candidate text regions are detected using the maximally
stable extremal region (MSER) algorithm. After applying MSER,
several non-text regions may still exist. To remove these non-text
regions, we apply some geometric properties. Further filtration of
non-text regions is carried out using stroke width transform (SWT).
After obtaining positive text regions, we apply bounding boxes for
forming text lines on each textual component. Once the text is
localised and detected, it is faded into optical character recognition
(OCR) for recognition. A neural probabilistic language model
(LM) is employed for forming individual keywords from
recognised text. Four different distance similarity measures
including Euclidean distance, Canberra distance, Manhattan
distance, and Cosine similarity are used to compute the similarity
between the query image and the database images. Finally, top-
rank images are retrieved based on the distance computation. The
schematic illustration of the proposed approach is shown in Fig. 2.
Fig. 1  Sample textual images from datasets
(a) ICDAR 2003 dataset, (b) Sindh dataset
Table 1 Art methods for text detection and localisation from natural scene images
Method Precision Recall F value Features Determination Datasets
Ezaki et al. [15] 60 64 62 connected component
based
text detection in natural scene
images
ICDAR 2003
Zhou et al. [16] 37 88 53 texture based text localisation and classification ICDAR 2003
Zhang and Kasturi [17] 67 46 — edge based text edges detection and extraction ICDAR 2003
Epshtein et al. [18] 73 60 66 stroke based SWTs ICDAR 2003ICDAR 2005
Ma et al. [19] 67 72 — edge and CC based component analysis and edge
detection
ICDAR 2003
Neumann and Matas [20] 59 55 57 texture and edge based text localisation using MSER ICDAR 2003Chars75K
Yi and Tian [21] 71 62 62 connected components
based
text detection in natural scene
images
ICDAR 2003OSTD
International conference on document analysis and recognition (ICDAR)
Oriented scene text dataset (OSTD)
516 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
2.1 Character candidate extraction
MSER has been identified as one of the best region detector due to
its robustness against scale, viewpoint, and light changes. Several
methods have adapted MSER to extract character candidate and
achieved satisfactory results [29–31]. The main pros of MSER
algorithm over other traditional methods is that the MSER can
detect most of the textual components even with the low-quality
images. Generally, the text has distinct contrast and appearance to
its complex background and comparatively uniform colour
intensity; hence, MSER is the best choice. The proposed method
employs MSER to extract character candidate regions [32].
Let p1, p2, p3, …, pi be the sequence of possible extremal
regions, that is, pi ⊂ pi + 1, pi is an MSER if
v(i) = pi + Δ − pi / pi (1)
If v(i) has a local minimum at i, then pi is one MSER, where Δ is a
parameter. After applying MSER filter, the obtained text regions
are shown in Fig. 3b.
2.2 Non-text objects filtering
After applying MSER, many non-text objects may still exist. We
apply simple geometric properties such as width, height, aspect
ratio etc. to filter out obvious non-text objects. The objects having
maximum and minimum variations are eliminated first. There are
numerous geometric properties that are best to distinguish text and
non-text objects. The proposed method observes some of the
geometric properties described in [33, 34] to eliminate non-text
objects.
Aspect ratio: The aspect ratio can be given as
Aspect_ratio =
max (width, height)
min (width, height)
(2)
We set the limit of character candidates’ aspect ratio between 0.1
and 10. As some characters are very similar such as ‘0’ and ‘O’, ‘i’
and ‘l’; hence, we merge them into one category.
Eccentricity: It is the distance ratio between the concentrations
of the ellipse and its major axis length. It returns a scalar value that
specifies the eccentricity of ellipse that has an equivalent second
moment as the region value. An ellipse having 0 eccentricity is a
circle and 1 eccentricity is a line segment. We set its value >0.995
to form a line segment.
Extent: It returns a scalar that represents the ratio of pixels in
the region to pixels in the total bounding boxes. It can be computed
as total area divided by area of bounding box. We set its value
between 0.2 and 0.9.
Solidity: It returns a scalar that specifies the amount of the
pixels in the convex hull that are also in the region. It can be
computed as the area divided by convex area. We limit its value
<0.3.
Euler number: It returns a scalar that specifies the number of
objects in a region minus the number of holes in the objects. It uses
8-connectivity to compute Euler number. We set its limit <−4.
Size: Character candidates having the size of <5 px are
regretted. As it may contain very limited information which can
lead to time consumption.
Most of the obvious non-text objects are removed after applying
the above geometric properties, as shown in Fig. 4a. Once these
conditions are satisfied, a character candidate can be further
processed to the next step.
2.3 Stroke width filtering
Geometric properties may not fully eliminate the non-text objects.
Another common method used to distinguish text and non-text
objects is stroke width. Stroke width can be defined as the length of
a straight line from a text pixel to another pixel toward its gradient
direction [35]. Several methods adapted stroke width for false
positives elimination as it computes the width of curves and lines
Fig. 2  Schematic illustration of the proposed method
Fig. 3  Textual regions extraction
(a) Original image, (b) Detected MSER regions
Fig. 4  Text detection and localisation
(a) Geometric-based non-text objects filtering, (b) SWT-based non-text objects
filtering
IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
517
which can form a character [18, 36]. Text regions can have less
stroke width variation, while non-text regions can have more
variations. The proposed method follow up SWT to further
eliminate false positives [18]. SWT is a local image operator that
measures width per pixel of most prospective stroke consisting of
the pixel. The size of output image is equal to the size of input
image while each element containing the width of the stroke
associated to the pixel.
First, the initial value for each component of SWT is set to ∞.
A gradient direction dp of each pixel p is measured. If p covers
stroke boundary, then dp also is perpendicular to the direction of
the stroke. The ray r = p + n dp, n > 0 is observed until another
edge pixel q is found and the gradient direction dp is considered at
pixel q. If dq is not obtained according to dp(dq = − dp ± π/6),
then each component of SWT output image consisting the segment
[p, q] is allocated the width of ∥ p − q ∥ until it has low value. The
ray is rejected if q is not found or if dq is not obtained according to
dp.
We filter out the connected components based on criteria ratio
that is stroke width standard deviation (std) to their stroke width
mean. We set the ratio std/mean >0.5 obtained from ICDAR
benchmark [37]. Once the false positives are removed, the true
components are faded to the next step to form the text lines and
word grouping. The attained positive text regions are shown in
Fig. 4b.
2.4 Text line formation
The adjacent character components are grouped together to form a
straight line. To detect these lines, distinct textual components need
to be merged into meaningful words. Character candidates that
belong to the same text line are supposed to have similar properties
(i.e. stroke width, height, size, and intensity). First, the midpoint of
connected components is measured by applying Euclidean distance
D and orientation angle θ between each connected component pair.
In resultant, two maps are obtained, namely distance map and
orientation map. If D < Max Distance, then the two connected
components are supposed to be adjacent characters, where
Max Distance is the maximum Euclidean distance from each
connected component. By assuming that the text generally found in
the horizontal orientation, we set value of θ between −45° and 45°.
Each component pair satisfying the rule is checked by similarity
criteria described in [35]. The components that satisfy the
following criteria are processed further:
wi + wj > 1.3 × D
max (wi/wj, wj/wi) < 5
max (hi/hj, hj/hi) < 2.5
max (si/sj, sj/si) < 1.75
max (ni/nj, nj/ni) < 1.75
where wi, hi, si, ni denotes width, height, stroke width mean value,
and intensity of the bounding box, respectively. The threshold
values can be adjusted according to the experimentation. If a line
contains minimum three textual objects, it is declared to be a text
line. The process ends when no more components can be merged
further. The connected components satisfying the above conditions
are grouped together and the rest of the components are supposed
to be false positives and eliminated. The formed text lines are
shown in Fig. 5a.
Furthermore, the formed text lines are split into individual
words for recognition purpose. We compute the overlap ratio
between all the bounding box pairs by measuring the distance
between the textual component pairs. The proposed method finds
non-zero overlap ratio to locate group of neighbouring text regions.
A threshold T is given as
T = mean(D) + α × std(D) (3)
where D is the distance vector that specifies the horizontal distance
between the components. If the distance exceeds the threshold, it
considers the two components belong to different words and they
will be separated. We limit the value of α to 1.5 and bounding
boxes are applied to each word individually. The applied bounding
boxes are shown in Fig. 5b.
2.5 Text recognition and keywords formation
The true text regions detected in Section 2.4 are faded into OCR
engine. There are several commercial and open source OCR tools
freely available. We adapted Google's open source Tesseract OCR
engine [https://opensource.google.com/projects/tesseract] for text
recognition purpose. The proposed method employs recognised
text words as tags and keywords for indexing the images. The
natural phenomenon is to retrieve similar images based on its text
confidence score and having maximum string match. The text
words having high confidence score will be retrieved first. If no
text is detected in an image, it becomes bit difficult to retrieve the
image. Hence, we set the high recall ratio by increasing the false
positives to get maximum keywords. For keywords formation, the
proposed method employed a neural probabilistic LM that relies on
character-level inputs and its predictions are still realised at word
level [38]. The model is based on a convolutional neural network
and its output from a single layer is used as an input at time t to
recurrent neural network LM.
Given γ as the size of vocabulary of recognised keywords, C is
the vocabulary of characters, d is the dimensionality of each
character, and Q ∈ ℝd × c
is the character matrix. Suppose if word
k ∈ γ containing the combination of characters (c1, c2, …, cl),
where l is the length of word k. Then, the character-level
representation for word k can be given by the matrix Ck
∈ ℝd × l
,
where the jth column is corresponding to character cj. A narrow
convolution between Ck and a filter kernel H ∈ ℝd × w
of width w is
applied.
Then, a bias is added and a non-linearity is applied to attained
feature map f k
∈ ℝl − w + 1
. The ith component of f k
feature map can
be given as
f k
[i] = tanh( Ck
[ ∗ , i:i + w − 1], H + b) (4)
where Ck
[ ∗ , i:i + w − 1] is the i to (i + w − 1)th column of Ck
and
A, B = Tr(ABT
) is Frobenius inner product. Given max-over-time
yk
= max
i
f k
[i] (5)
as feature corresponding to filter H for the word k. For a given
filter, the basic approach is to acquire string having maximum
score. The network exploits several filters with varying width w to
obtain feature vector for each word k. The input of k is given as
yk
= (y1
k
, y2
k
, …, yh
k
) for total H filters H1, H2, …, Hh.
Fig. 5  Text formation
(a) Text line formation, (b) Keywords formation
518 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
3 Similarity measure
For an accurate image retrieval system, features extraction and
similarity measurement both plays an important role. Sometimes,
feature extraction is realised smoothly but the similarity
measurement is not chosen perfectly, hence the noisy result is
achieved. The proposed method supports two approaches of
operation: exact substring match and approximate substring match.
Exact substring match retrieves the images having exact matching
words compared with recognised keywords and approximate
substring match retrieves the images with closest matching string.
Generally, the exact substring match mode has the highest priority
over approximate match and will be retrieved first. To compute the
similarity distance between the detected text, we use Euclidean
distance, Canberra distance, Manhattan distance, and Cosine
similarity.
The feature vector of each image in database is given as
FDBi
= {w1, w2, …, wN}, where N is the number of recognised
keywords in an image of database. Feature vector of query image q
is given as Fq = {w1, w2, …, wN}, where N is the number of
recognised keywords in q. The main idea is to select the similar
images from database having maximum matching strings with the
query image. The distance measures are given as follows.
Euclidean distance
D(FDBi
, Fq) = ∑
i = 1
N
(FDBi
− Fq)2
1/2
(6)
Canberra distance
D(FDBi
, Fq) = ∑
i = 1
N
FDBi
− Fq
FDBi
+ Fq
(7)
Manhattan distance
D(FDBi
, Fq) = ∑
i = 1
N
FDBi
− Fq (8)
Cosine similarity
D(FDBi
, Fq) =
FDBi
Fq
∥ FDBi
∥∥ Fq ∥
(9)
where FDBi
is the feature vector of an image in database and Fq is
the feature vector of query image.
4 Experimental results and discussion
In this section, we will briefly present the experimental results and
performance evaluation. All experiments are implemented and
executed on a computer with 8 GB random access memory and
3.10 GHz central processing unit Intel Core-i5-2100.
4.1 Datasets
The experiments are conducted on two benchmark datasets to
ensure the accuracy and robustness of the proposed approach. Both
the datasets containing the textual images. The datasets are given
as follows:
ICDAR 2003: The dataset [37] contains 500 natural scene
images with the varying resolution from 640 × 480 to 1600 × 1200.
Out of which 251 images belongs to TrialTrain set and 249 images
belongs to TrialTest set. The images are captured indoor and
outdoor under the varying conditions (i.e. text size, font, colour,
illumination, and position). The text appeared on signboards,
banners, posters, and other objects.
Sindh: We propose a new dataset, namely Sindh that contains
total 1000 images including quotation images, twitter snapshots,
natural scene images, TV news channel video frames, and other
textual images. The resolution of each image varies from
320 × 240 to 1920 × 1440, and collected randomly from Google,
Instagram, and twitter. We divided these images into 20 different
groups.
4.2 Retrieval performance protocol
The performance accuracy of image retrieval can be computed
using the mean average precision (mAP) that is the average of all
image queries. The AP of top-ranked images is given as
P(Rk) =
number of(relevant images ∩ retrieved images)
number of(retrieved images)
(10)
where Rk is the top retrieved images and we set k = 10 since the
users are more concerned with top-ranked results. The AP value for
a single query is the average of precision value obtained for set of k
images. The AP values are then averaged for all the queries. Given
the set of relevant images for a query qi ∈ Q as {I1, …, Im}, where
Q is the set of all the queries, the mAP can be given as
mAP(Q) =
1
Q ∑
i = 1
Q
1
m ∑
k = 1
m
P(Rk) (11)
4.3 Implementation detail
The proposed method detects the embedded text within images and
uses it as keywords/tags for retrieving the textual images. To
evaluate the performance and efficiency of proposed approach, first
we will perform the experiments for text detection and recognition,
and compare with the art methods. Next, we will evaluate the
proposed method for detected TBIR.
4.3.1 Text detection and recognition: In this section, we
conducted the two experiments: (i) text detection and (ii) end-to-
end text recognition.
Experiment I: For an accurate and robust system, text detection
is the most important task. For this purpose, we perform text
detection evaluation on the benchmark datasets defined in Section
4.1 and compared the results with state-of-the-art methods. We
follow up the standard evaluation protocols of precision p and
recall r stated in [37]. The precision p, recall r, and frequency
measure f are given as
p′ =
Σre ∈ Em(re, T)
E
(12)
r′ =
Σrt ∈ Tm(rt, E)
T
(13)
f =
1
(α/p′) + ((1 − α)/r′)
(14)
where E is the number of total estimated words, T is the ground
truth targets. The frequency measure f is used to combine precision
and recall. The relative weights of precision and recall are
controlled by α. All the performance measures are computed for
each image and then an average result is set for the performance of
the proposed approach. For ICDAR 2003 dataset, the proposed
approach achieved 74% precision and 68% recall values. For Sindh
dataset, the method achieved 75% precision and 70% recall values.
The results demonstrate that the proposed approach outperformed
state-of-the-art methods for precision and f values on ICDAR'03
dataset. For Sindh dataset, the results have low accuracy due to the
high complexity of different categories of images. The obtained
results are given in Tables 2 and 3 for ICDAR 2003 and Sindh
dataset, respectively.
Experiment II: We evaluated the performance for end-to-end
word recognition on ICDAR 2003 and Sindh dataset. There are two
measure metrics for recognition performance: normalised edit
distance and word level recognition. The former is outdated metric
as it can bear partial local error in each word. We use the latter
metric that is quite strict which needs each character recognised
correctly. For word recognition purpose, we again follow up the
IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
519
recognition evaluation protocols defined in [37]. Precision p is the
ratio of total number of words recognised correctly to the total
words recognised by the system. Recall r is the ratio of total
number of words recognised correctly to the total words localised
and detected. If a bounding box overlaps a ground truth bounding
box, it is counted as a match. The overlapping ratio is set to >50%.
Tables 4 and 5 show the performances of different recognition
methods evaluated on ICDAR 2003 and Sindh datasets. The
performance of the proposed method is computed by word level
recognition rate that is commonly used for fair comparison.
4.3.2 Image retrieval performance: To ensure the retrieval
accuracy of the proposed approach, similarity measure is computed
and experiments are conducted on ICDAR 2003 and Sindh dataset.
From ICDAR 2003 dataset, we randomly select 100 images and
use them as query image. Sindh dataset is divided in 20 categories;
we randomly select 10 images from each category and use them as
query image, hence a total 200 images as query. We compute the
precision and recall ratios for each image in the database. Precision
p is the ratio of total number of retrieved relevant images to the
total number of retrieved images. Recall r is the ratio of total
number of retrieved relevant images to the total number of relevant
images from the dataset, where p shows the accuracy of the
retrieval system and r shows the robustness of the retrieval system.
We considered the top retrieved image would be the one having
maximum number of similar words.
In this experiment, when the user provides a textual query
image, the system will automatically detect the text and use it to
index the images. If the image does not contain any text, the
system will add an auxiliary value of ‘1’. Here, the two operations
are performed: exact substring match and approximate substring
match. The exact substring match retrieves the images having exact
same words as compared with the query image. The approximate
substring match retrieves the images having closest confidence
score with the query image. Here, the exact substring match has the
higher priority over approximate match until the k number of
images are retrieved. We perform and compared the image retrieval
for the datasets defined in Section 4.1 on Liu's method [42] that is
only visual-based image retrieval method. Table 6 shows the
obtained retrieval accuracy results for the proposed method. The
results demonstrate that Liu's method could not perform well for
textual images. However, the proposed method performed well
specifically on textual images.
4.4 Retrieval time complexity
For image retrieval, the minimum computation and retrieval time
are curious factors. The computation time and features selection
are reverse to each other. Extraction of additional features can lead
to more time consumption. The proposed methods find a good
balance between text detection and image retrieval. The time
complexity of the proposed approach is given in Table 7 for both
the benchmarks. The results demonstrate that the proposed method
outperformed Liu's method on ICDAR 2003 dataset. However, the
computation time of the proposed method on Sindh dataset is
slightly more as compared with Liu's method due to high complex
background and huge number of small size fonts. Sindh dataset
contains several images of small fonts; hence, it is very
complicated to process the small fonts in an average mean time.
4.5 AP at different distances
For an accurate retrieval system, distance measure computation is
also crucial factor. Different distance measures can lead to different
effects on retrieval results. We computed four different similarity
measures defined in Section 3 to ensure the retrieval accuracy of
the proposed method. Fig. 6 shows the accuracy performance at
different distance measures for k number of images. Results show
that Euclidean distance is performed well as compared with other
distances.
5 Conclusion
In this paper, we have investigated an effective image retrieval
method for textual images based on the embedded and scene text.
First, the proposed method detects the candidate text regions using
the MSER algorithm. The non-text regions are eliminated using the
geometric properties and SWT. The remaining connected
components are grouped together using the bounding boxes. The
detected and localised text regions are faded into OCR engine for
recognising the text. The keywords are formed using a neural
probabilistic LM for image retrieval purpose. Finally, the textual
images are indexed and retrieved based on the detected keywords
using the four different distance measures. To validate the proposed
method for embedded and scene text images, we have offered a
Table 2 Performance comparison of text detection on
ICDAR 2003 dataset
Method Precision Recall f
Neumann and Matas [20] 0.59 0.55 0.57
Li and Lu [35] 0.59 0.59 0.59
Pan et al. [39] 0.66 0.70 0.68
Chen et al. [34] 0.73 0.60 0.66
proposed method 0.74 0.68 0.70
Table 3 Performance evaluation of text detection on Sindh
dataset
Method Precision Recall f
proposed method 0.75 0.70 0.72
Table 4 Performance comparison for end-to-end text
recognition on ICDAR 2003
Method IC03-50 IC03-full
Kai et al. [40] 0.55 0.56
Wang et al. [41] 0.72 0.67
proposed method 0.73 0.69
Table 5 Performance evaluation for end-to-end text
recognition on Sindh
Method Sindh-50 Sindh-full
proposed method 0.67 0.73
Table 6 Retrieval performance evaluation (mAP) on ICDAR
2003 and Sindh
Dataset/method Liu's method [42] Proposed method
ICDAR 2003 0.54 0.63
Sindh 0.59 0.71
Table 7 Retrieval time (s) complexity on ICDAR 2003 and
Sindh
Dataset/method Liu's method [42] Proposed method
ICDAR 2003 3.46 3.17
Sindh 4.38 4.51
Fig. 6  Performance of top-ranked retrieval images on
(a) ICDAR 2003, (b) Sindh dataset
520 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
new dataset containing quote images, twitter snapshots, natural
scene images, TV news channel video frames, and other textual
images. The experimental results on two benchmark datasets show
the effectiveness of the proposed approach for textual images. The
method is robust against varying properties of text such as font,
size, colour, illumination, and orientation in complex background.
In future, we intend to improve the proposed approach by fusing
textual features with visual features for more accurate and efficient
retrieval method.
6 Acknowledgments
This research was supported by the National Natural Science
Foundation of China (Nos. 61672124 and 61370145), the Password
Theory Project of the 13th Five-Year Plan National Cryptography
Development Fund (No MMJJ20170203).
7 References
[1] Wang, X., Wang, Z.: ‘The method for image retrieval based on multi-factors
correlation utilizing block truncation coding’, Pattern Recognit., 2014, 47,
(10), pp. 3293–3303
[2] Fadaei, S., Amirfattahi, R., Ahmadzadeh, M.R.: ‘New content-based image
retrieval system based on optimised integration of DCD, wavelet and curvelet
features’, IET Image Process., 2017, 11, (2), pp. 89–98
[3] Alzu'bi, A., Amira, A., Ramzan, N.: ‘Semantic content-based image retrieval:
a comprehensive study’, J. Vis. Commun. Image Represent., 2015, 32, pp. 20–
54
[4] Jiang, F., Hu, H.-M., Zheng, J., et al.: ‘A hierarchal BoW for image retrieval
by enhancing feature salience’, Neurocomputing, 2016, 175, pp. 146–154
[5] ElAdel, A., Ejbali, R., Zaied, M., et al.: ‘A hybrid approach for content-based
image retrieval based on fast beta wavelet network and fuzzy decision support
system’, Mach. Vis. Appl., 2016, 27, (6), pp. 1–19
[6] Feng, L., Wu, J., Liu, S., et al.: ‘Global correlation descriptor: a novel image
representation for image retrieval’, J. Vis. Commun. Image Represent., 2015,
33, pp. 104–114
[7] Liu, P., Guo, J.M., Chamnongthai, K., et al.: ‘Fusion of color histogram and
LBP-based features for texture image retrieval and classification’, Inf. Sci.
(NY), 2017, 390, pp. 95–111
[8] Walia, E., Pal, A.: ‘Fusion framework for effective color image retrieval’, J.
Vis. Commun. Image Represent., 2014, 25, (6), pp. 1335–1348
[9] Wang, X., Wang, Z.: ‘A novel method for image retrieval based on structure
elements’ descriptor’, J. Vis. Commun. Image Represent., 2013, 24, (1), pp.
63–74
[10] Tang, Y., Wu, X.: ‘Scene text detection and segmentation based on cascaded
convolution neural networks’, IEEE Trans. Image Process., 2017, 26, (3), pp.
1509–1520
[11] Wei, Y., Zhang, Z., Shen, W., et al.: ‘Text detection in scene images based on
exhaustive segmentation’, Signal Process. Image Commun., 2017, 50, pp. 1–8
[12] Unar, S., Jalbani, A.H., Shaikh, M., et al.: ‘A study on text detection and
localization techniques for natural scene images’, Int. J. Comput. Sci. Netw.
Secur., 2018, 18, (1), pp. 99–111
[13] Zheng, Y., Li, Q., Liu, J., et al.: ‘A cascaded method for text detection in
natural scene images’, Neurocomputing, 2017, 238, pp. 307–315
[14] Unar, S., Jalbani, A.H., Jawaid, M.M., et al.: ‘Artificial Urdu text detection
and localization from individual video frames’, Mehran Univ. Res. J. Eng.
Technol., 2018, 37, (2), pp. 429–438
[15] Ezaki, N., Bulacu, M., Schomaker, L.: ‘Text detection from natural scene
images: towards a system for visually impaired persons’. 17th Int. Conf.
Pattern Recognition (ICPR), 2004, vol. 2, pp. 683–686
[16] Zhou, G., Liu, Y., Meng, Q., et al.: ‘Detecting multilingual text in natural
scene’. Proc. 2011 First Int. Symp. Access Spaces ISAS 2011, 2011, pp. 116–
120
[17] Zhang, J., Kasturi, R.: ‘Text detection using edge gradient and graph
spectrum’. Proc. Int. Conf. Pattern Recognition, 2010, pp. 3979–3982
[18] Epshtein, B., Ofek, E., Wexler, Y.: ‘Detecting text in natural scenes with
stroke width transform’. Proc. IEEE Computer Society Conf. Computer
Vision Pattern Recognition, 2010, pp. 2963–2970
[19] Ma, L., Wang, C., Xiao, B.: ‘Text detection in natural images based on multi-
scale edge detection and classification’. 2010 Third Int. Congress Image
Signal Processing, 2010, vol. 4, pp. 1961–1965
[20] Neumann, L., Matas, J.: ‘A method for text localization and recognition in
real-world images’. Lecture Notes in Computer Science (including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
2011 (LNCS, 6494), (PART 3), pp. 770–783
[21] Yi, C., Tian, Y.: ‘Text string detection from natural scenes by structure-based
partition and grouping’, IEEE Trans. Image Process., 2011, 20, (9), pp. 2594–
2605
[22] Yang, H.-Y., Li, Y.-W., Li, W.-Y., et al.: ‘Content-based image retrieval using
local visual attention feature’, J. Vis. Commun. Image Represent., 2014, 25,
(6), pp. 1308–1323
[23] Wang, Y., Cen, Y., Zhao, R., et al.: ‘Separable vocabulary and feature fusion
for image retrieval based on sparse representation’, Neurocomputing, 2017,
236, pp. 14–22
[24] Dimitrovski, I., Kocev, D., Loskovska, S., et al.: ‘Improving bag-of-visual-
words image retrieval with predictive clustering trees’, Inf. Sci. (NY), 2016,
329, pp. 851–865
[25] Li, S., Purushotham, S., Chen, C., et al.: ‘Measuring and predicting tag
importance for image retrieval’, IEEE Trans. Pattern Anal. Mach. Intell.,
2017, 8828, (c), pp. 1–14
[26] Wu, L., Jin, R., Jain, A.K.: ‘Tag completion for image retrieval’, IEEE Trans.
Pattern Anal. Mach. Intell., 2013, 35, (3), pp. 716–727
[27] Liu, D., Wang, M., Yang, L., et al.: ‘Tag quality improvement for social
images’. Proc. 2009 IEEE Int. Conf. Multimedia Expo ICME 2009, 2009, pp.
350–353
[28] Cui, C., Lin, P., Nie, X., et al.: ‘Hybrid textual–visual relevance learning for
content-based image retrieval’, J. Vis. Commun. Image Represent., 2017, 48,
pp. 367–374
[29] Neumann, L., Matas, J.: ‘Real-time scene text localization and recognition’.
Proc. IEEE Computer Society Conf. Computer Vision Pattern Recognition,
2012, pp. 3538–3545
[30] Shi, C., Wang, C., Xiao, B., et al.: ‘Scene text detection using graph model
built upon maximally stable extremal regions’, Pattern Recognit. Lett., 2013,
34, (2), pp. 107–116
[31] Felhi, M., Bonnier, N., Tabbone, S.: ‘A skeleton based descriptor for detecting
text in real scene images’. 2012 21st Int. Conf. Pattern Recognition (ICPR),
2012, pp. 282–285
[32] Matas, J., Chum, O., Urban, M., et al.: ‘Robust wide-baseline stereo from
maximally stable extremal regions’, Image Vis. Comput., 2004, 22, (10), pp.
761–767
[33] Bergasa, L.M., Yebes, J.J.: ‘Text location in complex images’. Int. Conf.
Pattern Recognition (ICPR), 2012, pp. 617–620
[34] Chen, H., Tsai, S.S., Schroth, G., et al.: ‘Robust text detection in natural
images with edge-enhanced maximally stable extremal regions’. 2011 18th
IEEE Int. Conf. Image Processing (ICIP), 2011, pp. 3–6
[35] Li, Y., Lu, H.: ‘Scene text detection via stroke width’. Proc. Int. Conf. Pattern
Recognition, 2012, pp. 681–684
[36] Jung, C., Liu, Q., Kim, J.: ‘A stroke filter and its application to text
localization’, Pattern Recognit. Lett., 2009, 30, (2), pp. 114–122
[37] Lucas, S.M., Panaretos, A., Sosa, L., et al.: ‘ICDAR 2003 robust reading
competitions’. Proc. Int. Conf. Document Analysis Recognition ICDAR,
2003, pp. 682–687
[38] Kim, Y., Jernite, Y., Sontag, D., et al.: ‘Character-aware neural language
models’. Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 2741–2749
[39] Pan, Y.F., Liu, C.L., Hou, X.: ‘Fast scene text localization by learning-based
filtering and verification’. Proc. Int. Conf. Image Processing ICIP, 2010, pp.
2269–2272
[40] Kai, W., Babenko, B., Belongie, S.: ‘End-to-end scene text recognition’. 2011
IEEE Int. Conf. Computer Vision (ICCV), 2011, no. 4, pp. 1457–1464
[41] Wang, T., Wu, D.J., Coates, A, et al.: ‘End-to-end text recognition with
convolutional neural networks’. 21st Int. Conf. Pattern Recognition, 2012, pp.
3304–3308
[42] Liu, G.-H., Yang, J.-Y., Li, Z.: ‘Content-based image retrieval using
computational visual attention model’, Pattern Recognit., 2015, 48, (8), pp.
2554–2566
IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521
© The Institution of Engineering and Technology 2018
521

More Related Content

What's hot

Precision face image retrieval by extracting the face features and comparing ...
Precision face image retrieval by extracting the face features and comparing ...Precision face image retrieval by extracting the face features and comparing ...
Precision face image retrieval by extracting the face features and comparing ...prjpublications
 
Content based image retrieval project
Content based image retrieval projectContent based image retrieval project
Content based image retrieval projectaliaKhan71
 
Robust and Radial Image Comparison Using Reverse Image Search
Robust and Radial Image Comparison Using Reverse Image  Search Robust and Radial Image Comparison Using Reverse Image  Search
Robust and Radial Image Comparison Using Reverse Image Search IJMER
 
10.1.1.432.9149
10.1.1.432.914910.1.1.432.9149
10.1.1.432.9149moemi1
 
IRJET- Image based Information Retrieval
IRJET- Image based Information RetrievalIRJET- Image based Information Retrieval
IRJET- Image based Information RetrievalIRJET Journal
 
Design and Development of an Algorithm for Image Clustering In Textile Image ...
Design and Development of an Algorithm for Image Clustering In Textile Image ...Design and Development of an Algorithm for Image Clustering In Textile Image ...
Design and Development of an Algorithm for Image Clustering In Textile Image ...IJCSEA Journal
 
Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...researchinventy
 
Content-based Image Retrieval System for an Image Gallery Search Application
Content-based Image Retrieval System for an Image Gallery Search Application Content-based Image Retrieval System for an Image Gallery Search Application
Content-based Image Retrieval System for an Image Gallery Search Application IJECEIAES
 
CBIR For Medical Imaging...
CBIR For Medical  Imaging...CBIR For Medical  Imaging...
CBIR For Medical Imaging...Isha Sharma
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image RetrievalPrem kumar
 
Content Based Image Retrieval: A Review
Content Based Image Retrieval: A ReviewContent Based Image Retrieval: A Review
Content Based Image Retrieval: A ReviewIRJET Journal
 
Content Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval AlgorithmContent Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval AlgorithmAkshit Bum
 
Survey on content based image retrieval techniques
Survey on content based image retrieval techniquesSurvey on content based image retrieval techniques
Survey on content based image retrieval techniqueseSAT Publishing House
 
Content Based Image Retrieval An Assessment
Content Based Image Retrieval An AssessmentContent Based Image Retrieval An Assessment
Content Based Image Retrieval An Assessmentijtsrd
 

What's hot (18)

F010433136
F010433136F010433136
F010433136
 
Ic3414861499
Ic3414861499Ic3414861499
Ic3414861499
 
K018217680
K018217680K018217680
K018217680
 
Precision face image retrieval by extracting the face features and comparing ...
Precision face image retrieval by extracting the face features and comparing ...Precision face image retrieval by extracting the face features and comparing ...
Precision face image retrieval by extracting the face features and comparing ...
 
Content based image retrieval project
Content based image retrieval projectContent based image retrieval project
Content based image retrieval project
 
Robust and Radial Image Comparison Using Reverse Image Search
Robust and Radial Image Comparison Using Reverse Image  Search Robust and Radial Image Comparison Using Reverse Image  Search
Robust and Radial Image Comparison Using Reverse Image Search
 
10.1.1.432.9149
10.1.1.432.914910.1.1.432.9149
10.1.1.432.9149
 
IRJET- Image based Information Retrieval
IRJET- Image based Information RetrievalIRJET- Image based Information Retrieval
IRJET- Image based Information Retrieval
 
CBIR with RF
CBIR with RFCBIR with RF
CBIR with RF
 
Design and Development of an Algorithm for Image Clustering In Textile Image ...
Design and Development of an Algorithm for Image Clustering In Textile Image ...Design and Development of an Algorithm for Image Clustering In Textile Image ...
Design and Development of an Algorithm for Image Clustering In Textile Image ...
 
Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...Research Inventy : International Journal of Engineering and Science is publis...
Research Inventy : International Journal of Engineering and Science is publis...
 
Content-based Image Retrieval System for an Image Gallery Search Application
Content-based Image Retrieval System for an Image Gallery Search Application Content-based Image Retrieval System for an Image Gallery Search Application
Content-based Image Retrieval System for an Image Gallery Search Application
 
CBIR For Medical Imaging...
CBIR For Medical  Imaging...CBIR For Medical  Imaging...
CBIR For Medical Imaging...
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
 
Content Based Image Retrieval: A Review
Content Based Image Retrieval: A ReviewContent Based Image Retrieval: A Review
Content Based Image Retrieval: A Review
 
Content Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval AlgorithmContent Based Image and Video Retrieval Algorithm
Content Based Image and Video Retrieval Algorithm
 
Survey on content based image retrieval techniques
Survey on content based image retrieval techniquesSurvey on content based image retrieval techniques
Survey on content based image retrieval techniques
 
Content Based Image Retrieval An Assessment
Content Based Image Retrieval An AssessmentContent Based Image Retrieval An Assessment
Content Based Image Retrieval An Assessment
 

Similar to Ts2 c topic

Research Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and ScienceResearch Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and Scienceresearchinventy
 
Efficient CBIR Using Color Histogram Processing
Efficient CBIR Using Color Histogram ProcessingEfficient CBIR Using Color Histogram Processing
Efficient CBIR Using Color Histogram Processingsipij
 
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...IJCSEA Journal
 
A novel Image Retrieval System using an effective region based shape represen...
A novel Image Retrieval System using an effective region based shape represen...A novel Image Retrieval System using an effective region based shape represen...
A novel Image Retrieval System using an effective region based shape represen...CSCJournals
 
A Comparative Study of Content Based Image Retrieval Trends and Approaches
A Comparative Study of Content Based Image Retrieval Trends and ApproachesA Comparative Study of Content Based Image Retrieval Trends and Approaches
A Comparative Study of Content Based Image Retrieval Trends and ApproachesCSCJournals
 
CONTENT BASED IMAGE RETRIEVAL SYSTEM
CONTENT BASED IMAGE RETRIEVAL SYSTEMCONTENT BASED IMAGE RETRIEVAL SYSTEM
CONTENT BASED IMAGE RETRIEVAL SYSTEMVamsi IV
 
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...IJSRD
 
Techniques Used For Extracting Useful Information From Images
Techniques Used For Extracting Useful Information From ImagesTechniques Used For Extracting Useful Information From Images
Techniques Used For Extracting Useful Information From ImagesJill Crawford
 
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...Editor IJMTER
 
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEYAPPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEYcscpconf
 
Applications of spatial features in cbir a survey
Applications of spatial features in cbir  a surveyApplications of spatial features in cbir  a survey
Applications of spatial features in cbir a surveycsandit
 
A soft computing approach for image searching using visual reranking
A soft computing approach for image searching using visual rerankingA soft computing approach for image searching using visual reranking
A soft computing approach for image searching using visual rerankingIAEME Publication
 
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means ClusteringContent-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means ClusteringEECJOURNAL
 
A Review on Matching For Sketch Technique
A Review on Matching For Sketch TechniqueA Review on Matching For Sketch Technique
A Review on Matching For Sketch TechniqueIOSR Journals
 
A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...
A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...
A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...IJMIT JOURNAL
 
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...Journal For Research
 

Similar to Ts2 c topic (20)

Research Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and ScienceResearch Inventy: International Journal of Engineering and Science
Research Inventy: International Journal of Engineering and Science
 
Efficient CBIR Using Color Histogram Processing
Efficient CBIR Using Color Histogram ProcessingEfficient CBIR Using Color Histogram Processing
Efficient CBIR Using Color Histogram Processing
 
Sub1547
Sub1547Sub1547
Sub1547
 
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
 
A novel Image Retrieval System using an effective region based shape represen...
A novel Image Retrieval System using an effective region based shape represen...A novel Image Retrieval System using an effective region based shape represen...
A novel Image Retrieval System using an effective region based shape represen...
 
A Comparative Study of Content Based Image Retrieval Trends and Approaches
A Comparative Study of Content Based Image Retrieval Trends and ApproachesA Comparative Study of Content Based Image Retrieval Trends and Approaches
A Comparative Study of Content Based Image Retrieval Trends and Approaches
 
CONTENT BASED IMAGE RETRIEVAL SYSTEM
CONTENT BASED IMAGE RETRIEVAL SYSTEMCONTENT BASED IMAGE RETRIEVAL SYSTEM
CONTENT BASED IMAGE RETRIEVAL SYSTEM
 
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
 
Techniques Used For Extracting Useful Information From Images
Techniques Used For Extracting Useful Information From ImagesTechniques Used For Extracting Useful Information From Images
Techniques Used For Extracting Useful Information From Images
 
Gi3411661169
Gi3411661169Gi3411661169
Gi3411661169
 
Ko3419161921
Ko3419161921Ko3419161921
Ko3419161921
 
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
FACE EXPRESSION IDENTIFICATION USING IMAGE FEATURE CLUSTRING AND QUERY SCHEME...
 
Et35839844
Et35839844Et35839844
Et35839844
 
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEYAPPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
 
Applications of spatial features in cbir a survey
Applications of spatial features in cbir  a surveyApplications of spatial features in cbir  a survey
Applications of spatial features in cbir a survey
 
A soft computing approach for image searching using visual reranking
A soft computing approach for image searching using visual rerankingA soft computing approach for image searching using visual reranking
A soft computing approach for image searching using visual reranking
 
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means ClusteringContent-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
Content-Based Image Retrieval by Multi-Featrus Extraction and K-Means Clustering
 
A Review on Matching For Sketch Technique
A Review on Matching For Sketch TechniqueA Review on Matching For Sketch Technique
A Review on Matching For Sketch Technique
 
A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...
A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...
A Survey On: Content Based Image Retrieval Systems Using Clustering Technique...
 
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
 

More from Harini Vemula

dokumen.tips_85224768-digital-signal-p-ramesh-babu.pdf
dokumen.tips_85224768-digital-signal-p-ramesh-babu.pdfdokumen.tips_85224768-digital-signal-p-ramesh-babu.pdf
dokumen.tips_85224768-digital-signal-p-ramesh-babu.pdfHarini Vemula
 
Bluetooth technology
Bluetooth technologyBluetooth technology
Bluetooth technologyHarini Vemula
 
Abstract deepika techniqual seminar
Abstract deepika techniqual seminarAbstract deepika techniqual seminar
Abstract deepika techniqual seminarHarini Vemula
 
Sc 3 and 4 th unit questions
Sc 3 and 4 th unit questionsSc 3 and 4 th unit questions
Sc 3 and 4 th unit questionsHarini Vemula
 

More from Harini Vemula (7)

postdoc_ee.pdf
postdoc_ee.pdfpostdoc_ee.pdf
postdoc_ee.pdf
 
dokumen.tips_85224768-digital-signal-p-ramesh-babu.pdf
dokumen.tips_85224768-digital-signal-p-ramesh-babu.pdfdokumen.tips_85224768-digital-signal-p-ramesh-babu.pdf
dokumen.tips_85224768-digital-signal-p-ramesh-babu.pdf
 
Bluetooth technology
Bluetooth technologyBluetooth technology
Bluetooth technology
 
Abstract deepika techniqual seminar
Abstract deepika techniqual seminarAbstract deepika techniqual seminar
Abstract deepika techniqual seminar
 
Ts2 c topic (1)
Ts2 c topic (1)Ts2 c topic (1)
Ts2 c topic (1)
 
Ts 2 b topic
Ts 2 b topicTs 2 b topic
Ts 2 b topic
 
Sc 3 and 4 th unit questions
Sc 3 and 4 th unit questionsSc 3 and 4 th unit questions
Sc 3 and 4 th unit questions
 

Recently uploaded

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 

Recently uploaded (20)

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 

Ts2 c topic

  • 1. IET Image Processing Research Article Detected text-based image retrieval approach for textual images ISSN 1751-9659 Received on 9th April 2018 Revised 22nd November 2018 Accepted on 3rd December 2018 E-First on 31st January 2019 doi: 10.1049/iet-ipr.2018.5277 www.ietdl.org Salahuddin Unar1, Xingyuan Wang1,2 , Chuan Zhang1, Chunpeng Wang3 1 Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, People's Republic of China 2 School of Information Science and Technology, Dalian Maritime University, Dalian 116026, People's Republic of China 3 School of Information, Qilu University of Technology, Shandong 250353, People's Republic of China E-mail: wangxy@dlut.edu.cn Abstract: This work addresses the problem of searching and retrieving similar textual images based on the detected text and opens the new directions for textual image retrieval. For image retrieval, several methods have been proposed to extract visual features and social tags; however, to extract embedded and scene text within images and use that text as automatic keywords/ tags is still a young research field for text-based and content-based image retrieval applications. The automatic text detection retrieval is an emerging technology for robotics and artificial intelligence. In this study, the authors have proposed a novel approach to detect the text in an image and exploit it as keywords and tags for automatic text-based image retrieval. First, text regions are detected using maximally stable extremal region algorithm. Second, unwanted false positive text regions are eliminated based on geometric properties and stroke width transform. Next, the true text regions are proceeded into optical character recognition for recognition. Third, keywords are formed using a neural probabilistic language model. Finally, the textual images are indexed and retrieved based on the detected keywords. The experimental results on two benchmark datasets show the dominancy of text is efficient and valuable for image retrieval specifically for textual images. 1 Introduction Recent advancement in information technology and digital media, capturing, and sharing the information (i.e. image, video, and audio) has significantly increased. It needs an efficient method to retrieve such information existing in excessive amount. For this purpose, content-based image retrieval (CBIR) acting as a backbone in multimedia and computer vision communities since the last two decades [1–3]. In CBIR, for a given query image, the system will retrieve a number of similar images to the user from the database. The resulting images can be similar to the query image in sense of colour, shape, and texture of objects within the image under varying conditions and complex background. Image retrieval is a wide research field that includes several methods from information retrieval, machine learning, multimedia research, computer vision, and human–computer interaction. Image retrieval methods can be classified into two groups: text-based image retrieval (TBIR) and CBIR. TBIR methods need some heuristic information in the textual form (i.e. image descriptions and tags) for each image, and then the indexing and retrieval are performed by the textual queries. Such methods are sufficient for the limited number of database images with precise tags and description. However, the limitation with these methods is they need a huge number of human labour to manually annotate each image. Nowadays, the images are existing in millions of number, and it is almost impossible to annotate each image manually. To overcome such limitations, CBIR methods have been introduced. CBIR methods describe the images by visual contents (i.e. colour, shape, and texture) and depend heavily on analysing image descriptors and similarity measurement. For a robust CBIR system, the main purpose is to achieve higher accuracy with minimum computation time. To boost the retrieval performance, several methods have been introduced to retrieve similar images from the database [4–6]. However, researchers have not yet standardised any ideal approach and it remains a challenging problem. Owing to increment in image data, simple features such as colour, shape, and texture are not enough sufficient to construe the image efficiently. The existing methods mostly focused on extracting visual features such as colour, texture, shape, and fusing multiple visual descriptors [7–9]. Indexing of similar images is achieved based on these visual features. However, no standard method has been proposed yet for retrieving the textual images. Day by day, the increasing usage of social media sites (e.g. Instagram, Flickr, and Facebook), millions of people share their pictures. Mostly, these pictures may contain textual information that is an additional and clearer clue to perceive the image. It is a common practise among the people to edit the pictures by writing inspirational and motivational quotes, as shown in Fig. 1b. Sometimes, the pictures captured in natural scene environment also contain textual data under complex background, as shown in Fig. 1a. Therefore, the text embedded in images might be useful information for automatic tagging, annotation, and indexing. Exploiting such information can be used to retrieve similar textual images to the query image. Consequently, to improve retrieval accuracy of TBIR and CBIR for the textual images, the detected text can be an enormous asset to perceive the image more intensely. The automatic extraction of textual contents is really a challenging yet efficient task for several computer vision-based applications. For example, to help a blind person to read the contents within an image or to help a tourist to translate the contents of an image. Retrieving textual contents can be greatly efficient for robots to perform their specific actions. In recent years, the problem of text detection and localisation from the images has gained much attention [10–14]. Several methods have been proposed to detect the text from the images. However, their core objective is to detect and localise the text only. They do not consider the detected text for retrieving the similar images. We have highlighted some of the well known methods for text detection given in Table 1. Most state-of-the-art CBIR methods proposed to explore visual features. Sometimes the visual features are fused together to achieve high accuracy. Image indexing is performed based on these visual features. In [7], Liu et al. proposed colour information feature (CIF) by adding it with local binary pattern (LBP)-based feature, as LBP-based feature is not good at capturing rich colour information sometimes. CIF is capable to describe colour IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018 515
  • 2. information, image brightness, and colour distribution. Walia and Pal [8] proposed a fusion framework that combines all the low- level features by employing colour difference histogram and angular radial transform. Yang et al. [22] presented a unique approach based on salient point detector and salient point expansion using local visual features. These salient points of image are obtained by using speeded-up robust features detector. To cope with large size of visual vocabulary, Wang et al. [23] proposed the hierarchy of medium-sized vocabularies. Sparse representation is adapted to select specific vocabulary. In [24], Dimitrovski et al. employed predictive clustering trees to construct indexing structure for codebook construction. The codebooks can efficiently increase the specific power of the dictionary. So far, these methods are developed for visual images (i.e. images containing the colourful objects) and cannot perform well for textual images. Moreover, several methods proposed to retrieve the images based on social tags and keywords. In [25], Li et al. proposed a model to extract visual, contextual, and semantic features to identify object and predicting the scene tag importance. However, sometimes these tags are not true and reflect the emotion only. Wu et al. [26] introduced a new method for incomplete and missing social tags. A tag matrix is used for image–tag relation that searches observed tags and visual similarity. Liu et al. [27] proposed a novel approach for improving improper social tags. Their approach represents the consistency of visual and semantic similarities along with social tags before and after improvements. Some authors introduced hybrid visual–textual relevance learning methods. Cui et al. [28] proposed a method based on textual–visual relevance learning. The method extracts the text from the image tags and associates the text with visual features. So far, these methods used visual features and social tags for indexing and retrieving the similar images and no standard method proposed yet to detect the text automatically and retrieve the images based on the detected text. In this paper, we have proposed a novel approach to retrieve the similar textual images by detecting embedded and scene text in textual images. In particular, we use text detection technique and employ the detected text as keywords and tags for indexing and retrieving the textual images. The key contributions of this work are as follows: • To our best knowledge, there has been no standard method for textual images retrieval. This work is one of the few innovative investigations on indexing and retrieving textual images effectively. • The proposed method is innovative in dealing with the textual images consisting the text within images (e.g. quote images, scene images, individual video frames). • A fully automatic TBIR method is proposed to retrieve similar textual images based on the detected text from the complex background images. The detected text is employed as keywords/ tags for indexing and retrieval. • The method is robust and efficient to retrieve similar textual images for the given textual image query, and is based on easy- to-use framework. • A new dataset of 1000 images consisting 20 categories is introduced. The dataset includes quote images, natural scene images, twitter snapshots, TV channel video frames, and other textual images. The rest of this paper is organised as follows: in Section 2, the proposed method is described briefly. Section 3 presents the different similarity distance measurements. In Section 4, a new dataset is introduced and experimental results are evaluated. Section 5 concludes the proposed approach and states future directions. 2 Proposed method In this section, we introduce a novel approach to detect the text, and employ that text to index and retrieve similar textual images. First, the candidate text regions are detected using the maximally stable extremal region (MSER) algorithm. After applying MSER, several non-text regions may still exist. To remove these non-text regions, we apply some geometric properties. Further filtration of non-text regions is carried out using stroke width transform (SWT). After obtaining positive text regions, we apply bounding boxes for forming text lines on each textual component. Once the text is localised and detected, it is faded into optical character recognition (OCR) for recognition. A neural probabilistic language model (LM) is employed for forming individual keywords from recognised text. Four different distance similarity measures including Euclidean distance, Canberra distance, Manhattan distance, and Cosine similarity are used to compute the similarity between the query image and the database images. Finally, top- rank images are retrieved based on the distance computation. The schematic illustration of the proposed approach is shown in Fig. 2. Fig. 1  Sample textual images from datasets (a) ICDAR 2003 dataset, (b) Sindh dataset Table 1 Art methods for text detection and localisation from natural scene images Method Precision Recall F value Features Determination Datasets Ezaki et al. [15] 60 64 62 connected component based text detection in natural scene images ICDAR 2003 Zhou et al. [16] 37 88 53 texture based text localisation and classification ICDAR 2003 Zhang and Kasturi [17] 67 46 — edge based text edges detection and extraction ICDAR 2003 Epshtein et al. [18] 73 60 66 stroke based SWTs ICDAR 2003ICDAR 2005 Ma et al. [19] 67 72 — edge and CC based component analysis and edge detection ICDAR 2003 Neumann and Matas [20] 59 55 57 texture and edge based text localisation using MSER ICDAR 2003Chars75K Yi and Tian [21] 71 62 62 connected components based text detection in natural scene images ICDAR 2003OSTD International conference on document analysis and recognition (ICDAR) Oriented scene text dataset (OSTD) 516 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018
  • 3. 2.1 Character candidate extraction MSER has been identified as one of the best region detector due to its robustness against scale, viewpoint, and light changes. Several methods have adapted MSER to extract character candidate and achieved satisfactory results [29–31]. The main pros of MSER algorithm over other traditional methods is that the MSER can detect most of the textual components even with the low-quality images. Generally, the text has distinct contrast and appearance to its complex background and comparatively uniform colour intensity; hence, MSER is the best choice. The proposed method employs MSER to extract character candidate regions [32]. Let p1, p2, p3, …, pi be the sequence of possible extremal regions, that is, pi ⊂ pi + 1, pi is an MSER if v(i) = pi + Δ − pi / pi (1) If v(i) has a local minimum at i, then pi is one MSER, where Δ is a parameter. After applying MSER filter, the obtained text regions are shown in Fig. 3b. 2.2 Non-text objects filtering After applying MSER, many non-text objects may still exist. We apply simple geometric properties such as width, height, aspect ratio etc. to filter out obvious non-text objects. The objects having maximum and minimum variations are eliminated first. There are numerous geometric properties that are best to distinguish text and non-text objects. The proposed method observes some of the geometric properties described in [33, 34] to eliminate non-text objects. Aspect ratio: The aspect ratio can be given as Aspect_ratio = max (width, height) min (width, height) (2) We set the limit of character candidates’ aspect ratio between 0.1 and 10. As some characters are very similar such as ‘0’ and ‘O’, ‘i’ and ‘l’; hence, we merge them into one category. Eccentricity: It is the distance ratio between the concentrations of the ellipse and its major axis length. It returns a scalar value that specifies the eccentricity of ellipse that has an equivalent second moment as the region value. An ellipse having 0 eccentricity is a circle and 1 eccentricity is a line segment. We set its value >0.995 to form a line segment. Extent: It returns a scalar that represents the ratio of pixels in the region to pixels in the total bounding boxes. It can be computed as total area divided by area of bounding box. We set its value between 0.2 and 0.9. Solidity: It returns a scalar that specifies the amount of the pixels in the convex hull that are also in the region. It can be computed as the area divided by convex area. We limit its value <0.3. Euler number: It returns a scalar that specifies the number of objects in a region minus the number of holes in the objects. It uses 8-connectivity to compute Euler number. We set its limit <−4. Size: Character candidates having the size of <5 px are regretted. As it may contain very limited information which can lead to time consumption. Most of the obvious non-text objects are removed after applying the above geometric properties, as shown in Fig. 4a. Once these conditions are satisfied, a character candidate can be further processed to the next step. 2.3 Stroke width filtering Geometric properties may not fully eliminate the non-text objects. Another common method used to distinguish text and non-text objects is stroke width. Stroke width can be defined as the length of a straight line from a text pixel to another pixel toward its gradient direction [35]. Several methods adapted stroke width for false positives elimination as it computes the width of curves and lines Fig. 2  Schematic illustration of the proposed method Fig. 3  Textual regions extraction (a) Original image, (b) Detected MSER regions Fig. 4  Text detection and localisation (a) Geometric-based non-text objects filtering, (b) SWT-based non-text objects filtering IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018 517
  • 4. which can form a character [18, 36]. Text regions can have less stroke width variation, while non-text regions can have more variations. The proposed method follow up SWT to further eliminate false positives [18]. SWT is a local image operator that measures width per pixel of most prospective stroke consisting of the pixel. The size of output image is equal to the size of input image while each element containing the width of the stroke associated to the pixel. First, the initial value for each component of SWT is set to ∞. A gradient direction dp of each pixel p is measured. If p covers stroke boundary, then dp also is perpendicular to the direction of the stroke. The ray r = p + n dp, n > 0 is observed until another edge pixel q is found and the gradient direction dp is considered at pixel q. If dq is not obtained according to dp(dq = − dp ± π/6), then each component of SWT output image consisting the segment [p, q] is allocated the width of ∥ p − q ∥ until it has low value. The ray is rejected if q is not found or if dq is not obtained according to dp. We filter out the connected components based on criteria ratio that is stroke width standard deviation (std) to their stroke width mean. We set the ratio std/mean >0.5 obtained from ICDAR benchmark [37]. Once the false positives are removed, the true components are faded to the next step to form the text lines and word grouping. The attained positive text regions are shown in Fig. 4b. 2.4 Text line formation The adjacent character components are grouped together to form a straight line. To detect these lines, distinct textual components need to be merged into meaningful words. Character candidates that belong to the same text line are supposed to have similar properties (i.e. stroke width, height, size, and intensity). First, the midpoint of connected components is measured by applying Euclidean distance D and orientation angle θ between each connected component pair. In resultant, two maps are obtained, namely distance map and orientation map. If D < Max Distance, then the two connected components are supposed to be adjacent characters, where Max Distance is the maximum Euclidean distance from each connected component. By assuming that the text generally found in the horizontal orientation, we set value of θ between −45° and 45°. Each component pair satisfying the rule is checked by similarity criteria described in [35]. The components that satisfy the following criteria are processed further: wi + wj > 1.3 × D max (wi/wj, wj/wi) < 5 max (hi/hj, hj/hi) < 2.5 max (si/sj, sj/si) < 1.75 max (ni/nj, nj/ni) < 1.75 where wi, hi, si, ni denotes width, height, stroke width mean value, and intensity of the bounding box, respectively. The threshold values can be adjusted according to the experimentation. If a line contains minimum three textual objects, it is declared to be a text line. The process ends when no more components can be merged further. The connected components satisfying the above conditions are grouped together and the rest of the components are supposed to be false positives and eliminated. The formed text lines are shown in Fig. 5a. Furthermore, the formed text lines are split into individual words for recognition purpose. We compute the overlap ratio between all the bounding box pairs by measuring the distance between the textual component pairs. The proposed method finds non-zero overlap ratio to locate group of neighbouring text regions. A threshold T is given as T = mean(D) + α × std(D) (3) where D is the distance vector that specifies the horizontal distance between the components. If the distance exceeds the threshold, it considers the two components belong to different words and they will be separated. We limit the value of α to 1.5 and bounding boxes are applied to each word individually. The applied bounding boxes are shown in Fig. 5b. 2.5 Text recognition and keywords formation The true text regions detected in Section 2.4 are faded into OCR engine. There are several commercial and open source OCR tools freely available. We adapted Google's open source Tesseract OCR engine [https://opensource.google.com/projects/tesseract] for text recognition purpose. The proposed method employs recognised text words as tags and keywords for indexing the images. The natural phenomenon is to retrieve similar images based on its text confidence score and having maximum string match. The text words having high confidence score will be retrieved first. If no text is detected in an image, it becomes bit difficult to retrieve the image. Hence, we set the high recall ratio by increasing the false positives to get maximum keywords. For keywords formation, the proposed method employed a neural probabilistic LM that relies on character-level inputs and its predictions are still realised at word level [38]. The model is based on a convolutional neural network and its output from a single layer is used as an input at time t to recurrent neural network LM. Given γ as the size of vocabulary of recognised keywords, C is the vocabulary of characters, d is the dimensionality of each character, and Q ∈ ℝd × c is the character matrix. Suppose if word k ∈ γ containing the combination of characters (c1, c2, …, cl), where l is the length of word k. Then, the character-level representation for word k can be given by the matrix Ck ∈ ℝd × l , where the jth column is corresponding to character cj. A narrow convolution between Ck and a filter kernel H ∈ ℝd × w of width w is applied. Then, a bias is added and a non-linearity is applied to attained feature map f k ∈ ℝl − w + 1 . The ith component of f k feature map can be given as f k [i] = tanh( Ck [ ∗ , i:i + w − 1], H + b) (4) where Ck [ ∗ , i:i + w − 1] is the i to (i + w − 1)th column of Ck and A, B = Tr(ABT ) is Frobenius inner product. Given max-over-time yk = max i f k [i] (5) as feature corresponding to filter H for the word k. For a given filter, the basic approach is to acquire string having maximum score. The network exploits several filters with varying width w to obtain feature vector for each word k. The input of k is given as yk = (y1 k , y2 k , …, yh k ) for total H filters H1, H2, …, Hh. Fig. 5  Text formation (a) Text line formation, (b) Keywords formation 518 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018
  • 5. 3 Similarity measure For an accurate image retrieval system, features extraction and similarity measurement both plays an important role. Sometimes, feature extraction is realised smoothly but the similarity measurement is not chosen perfectly, hence the noisy result is achieved. The proposed method supports two approaches of operation: exact substring match and approximate substring match. Exact substring match retrieves the images having exact matching words compared with recognised keywords and approximate substring match retrieves the images with closest matching string. Generally, the exact substring match mode has the highest priority over approximate match and will be retrieved first. To compute the similarity distance between the detected text, we use Euclidean distance, Canberra distance, Manhattan distance, and Cosine similarity. The feature vector of each image in database is given as FDBi = {w1, w2, …, wN}, where N is the number of recognised keywords in an image of database. Feature vector of query image q is given as Fq = {w1, w2, …, wN}, where N is the number of recognised keywords in q. The main idea is to select the similar images from database having maximum matching strings with the query image. The distance measures are given as follows. Euclidean distance D(FDBi , Fq) = ∑ i = 1 N (FDBi − Fq)2 1/2 (6) Canberra distance D(FDBi , Fq) = ∑ i = 1 N FDBi − Fq FDBi + Fq (7) Manhattan distance D(FDBi , Fq) = ∑ i = 1 N FDBi − Fq (8) Cosine similarity D(FDBi , Fq) = FDBi Fq ∥ FDBi ∥∥ Fq ∥ (9) where FDBi is the feature vector of an image in database and Fq is the feature vector of query image. 4 Experimental results and discussion In this section, we will briefly present the experimental results and performance evaluation. All experiments are implemented and executed on a computer with 8 GB random access memory and 3.10 GHz central processing unit Intel Core-i5-2100. 4.1 Datasets The experiments are conducted on two benchmark datasets to ensure the accuracy and robustness of the proposed approach. Both the datasets containing the textual images. The datasets are given as follows: ICDAR 2003: The dataset [37] contains 500 natural scene images with the varying resolution from 640 × 480 to 1600 × 1200. Out of which 251 images belongs to TrialTrain set and 249 images belongs to TrialTest set. The images are captured indoor and outdoor under the varying conditions (i.e. text size, font, colour, illumination, and position). The text appeared on signboards, banners, posters, and other objects. Sindh: We propose a new dataset, namely Sindh that contains total 1000 images including quotation images, twitter snapshots, natural scene images, TV news channel video frames, and other textual images. The resolution of each image varies from 320 × 240 to 1920 × 1440, and collected randomly from Google, Instagram, and twitter. We divided these images into 20 different groups. 4.2 Retrieval performance protocol The performance accuracy of image retrieval can be computed using the mean average precision (mAP) that is the average of all image queries. The AP of top-ranked images is given as P(Rk) = number of(relevant images ∩ retrieved images) number of(retrieved images) (10) where Rk is the top retrieved images and we set k = 10 since the users are more concerned with top-ranked results. The AP value for a single query is the average of precision value obtained for set of k images. The AP values are then averaged for all the queries. Given the set of relevant images for a query qi ∈ Q as {I1, …, Im}, where Q is the set of all the queries, the mAP can be given as mAP(Q) = 1 Q ∑ i = 1 Q 1 m ∑ k = 1 m P(Rk) (11) 4.3 Implementation detail The proposed method detects the embedded text within images and uses it as keywords/tags for retrieving the textual images. To evaluate the performance and efficiency of proposed approach, first we will perform the experiments for text detection and recognition, and compare with the art methods. Next, we will evaluate the proposed method for detected TBIR. 4.3.1 Text detection and recognition: In this section, we conducted the two experiments: (i) text detection and (ii) end-to- end text recognition. Experiment I: For an accurate and robust system, text detection is the most important task. For this purpose, we perform text detection evaluation on the benchmark datasets defined in Section 4.1 and compared the results with state-of-the-art methods. We follow up the standard evaluation protocols of precision p and recall r stated in [37]. The precision p, recall r, and frequency measure f are given as p′ = Σre ∈ Em(re, T) E (12) r′ = Σrt ∈ Tm(rt, E) T (13) f = 1 (α/p′) + ((1 − α)/r′) (14) where E is the number of total estimated words, T is the ground truth targets. The frequency measure f is used to combine precision and recall. The relative weights of precision and recall are controlled by α. All the performance measures are computed for each image and then an average result is set for the performance of the proposed approach. For ICDAR 2003 dataset, the proposed approach achieved 74% precision and 68% recall values. For Sindh dataset, the method achieved 75% precision and 70% recall values. The results demonstrate that the proposed approach outperformed state-of-the-art methods for precision and f values on ICDAR'03 dataset. For Sindh dataset, the results have low accuracy due to the high complexity of different categories of images. The obtained results are given in Tables 2 and 3 for ICDAR 2003 and Sindh dataset, respectively. Experiment II: We evaluated the performance for end-to-end word recognition on ICDAR 2003 and Sindh dataset. There are two measure metrics for recognition performance: normalised edit distance and word level recognition. The former is outdated metric as it can bear partial local error in each word. We use the latter metric that is quite strict which needs each character recognised correctly. For word recognition purpose, we again follow up the IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018 519
  • 6. recognition evaluation protocols defined in [37]. Precision p is the ratio of total number of words recognised correctly to the total words recognised by the system. Recall r is the ratio of total number of words recognised correctly to the total words localised and detected. If a bounding box overlaps a ground truth bounding box, it is counted as a match. The overlapping ratio is set to >50%. Tables 4 and 5 show the performances of different recognition methods evaluated on ICDAR 2003 and Sindh datasets. The performance of the proposed method is computed by word level recognition rate that is commonly used for fair comparison. 4.3.2 Image retrieval performance: To ensure the retrieval accuracy of the proposed approach, similarity measure is computed and experiments are conducted on ICDAR 2003 and Sindh dataset. From ICDAR 2003 dataset, we randomly select 100 images and use them as query image. Sindh dataset is divided in 20 categories; we randomly select 10 images from each category and use them as query image, hence a total 200 images as query. We compute the precision and recall ratios for each image in the database. Precision p is the ratio of total number of retrieved relevant images to the total number of retrieved images. Recall r is the ratio of total number of retrieved relevant images to the total number of relevant images from the dataset, where p shows the accuracy of the retrieval system and r shows the robustness of the retrieval system. We considered the top retrieved image would be the one having maximum number of similar words. In this experiment, when the user provides a textual query image, the system will automatically detect the text and use it to index the images. If the image does not contain any text, the system will add an auxiliary value of ‘1’. Here, the two operations are performed: exact substring match and approximate substring match. The exact substring match retrieves the images having exact same words as compared with the query image. The approximate substring match retrieves the images having closest confidence score with the query image. Here, the exact substring match has the higher priority over approximate match until the k number of images are retrieved. We perform and compared the image retrieval for the datasets defined in Section 4.1 on Liu's method [42] that is only visual-based image retrieval method. Table 6 shows the obtained retrieval accuracy results for the proposed method. The results demonstrate that Liu's method could not perform well for textual images. However, the proposed method performed well specifically on textual images. 4.4 Retrieval time complexity For image retrieval, the minimum computation and retrieval time are curious factors. The computation time and features selection are reverse to each other. Extraction of additional features can lead to more time consumption. The proposed methods find a good balance between text detection and image retrieval. The time complexity of the proposed approach is given in Table 7 for both the benchmarks. The results demonstrate that the proposed method outperformed Liu's method on ICDAR 2003 dataset. However, the computation time of the proposed method on Sindh dataset is slightly more as compared with Liu's method due to high complex background and huge number of small size fonts. Sindh dataset contains several images of small fonts; hence, it is very complicated to process the small fonts in an average mean time. 4.5 AP at different distances For an accurate retrieval system, distance measure computation is also crucial factor. Different distance measures can lead to different effects on retrieval results. We computed four different similarity measures defined in Section 3 to ensure the retrieval accuracy of the proposed method. Fig. 6 shows the accuracy performance at different distance measures for k number of images. Results show that Euclidean distance is performed well as compared with other distances. 5 Conclusion In this paper, we have investigated an effective image retrieval method for textual images based on the embedded and scene text. First, the proposed method detects the candidate text regions using the MSER algorithm. The non-text regions are eliminated using the geometric properties and SWT. The remaining connected components are grouped together using the bounding boxes. The detected and localised text regions are faded into OCR engine for recognising the text. The keywords are formed using a neural probabilistic LM for image retrieval purpose. Finally, the textual images are indexed and retrieved based on the detected keywords using the four different distance measures. To validate the proposed method for embedded and scene text images, we have offered a Table 2 Performance comparison of text detection on ICDAR 2003 dataset Method Precision Recall f Neumann and Matas [20] 0.59 0.55 0.57 Li and Lu [35] 0.59 0.59 0.59 Pan et al. [39] 0.66 0.70 0.68 Chen et al. [34] 0.73 0.60 0.66 proposed method 0.74 0.68 0.70 Table 3 Performance evaluation of text detection on Sindh dataset Method Precision Recall f proposed method 0.75 0.70 0.72 Table 4 Performance comparison for end-to-end text recognition on ICDAR 2003 Method IC03-50 IC03-full Kai et al. [40] 0.55 0.56 Wang et al. [41] 0.72 0.67 proposed method 0.73 0.69 Table 5 Performance evaluation for end-to-end text recognition on Sindh Method Sindh-50 Sindh-full proposed method 0.67 0.73 Table 6 Retrieval performance evaluation (mAP) on ICDAR 2003 and Sindh Dataset/method Liu's method [42] Proposed method ICDAR 2003 0.54 0.63 Sindh 0.59 0.71 Table 7 Retrieval time (s) complexity on ICDAR 2003 and Sindh Dataset/method Liu's method [42] Proposed method ICDAR 2003 3.46 3.17 Sindh 4.38 4.51 Fig. 6  Performance of top-ranked retrieval images on (a) ICDAR 2003, (b) Sindh dataset 520 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018
  • 7. new dataset containing quote images, twitter snapshots, natural scene images, TV news channel video frames, and other textual images. The experimental results on two benchmark datasets show the effectiveness of the proposed approach for textual images. The method is robust against varying properties of text such as font, size, colour, illumination, and orientation in complex background. In future, we intend to improve the proposed approach by fusing textual features with visual features for more accurate and efficient retrieval method. 6 Acknowledgments This research was supported by the National Natural Science Foundation of China (Nos. 61672124 and 61370145), the Password Theory Project of the 13th Five-Year Plan National Cryptography Development Fund (No MMJJ20170203). 7 References [1] Wang, X., Wang, Z.: ‘The method for image retrieval based on multi-factors correlation utilizing block truncation coding’, Pattern Recognit., 2014, 47, (10), pp. 3293–3303 [2] Fadaei, S., Amirfattahi, R., Ahmadzadeh, M.R.: ‘New content-based image retrieval system based on optimised integration of DCD, wavelet and curvelet features’, IET Image Process., 2017, 11, (2), pp. 89–98 [3] Alzu'bi, A., Amira, A., Ramzan, N.: ‘Semantic content-based image retrieval: a comprehensive study’, J. Vis. Commun. Image Represent., 2015, 32, pp. 20– 54 [4] Jiang, F., Hu, H.-M., Zheng, J., et al.: ‘A hierarchal BoW for image retrieval by enhancing feature salience’, Neurocomputing, 2016, 175, pp. 146–154 [5] ElAdel, A., Ejbali, R., Zaied, M., et al.: ‘A hybrid approach for content-based image retrieval based on fast beta wavelet network and fuzzy decision support system’, Mach. Vis. Appl., 2016, 27, (6), pp. 1–19 [6] Feng, L., Wu, J., Liu, S., et al.: ‘Global correlation descriptor: a novel image representation for image retrieval’, J. Vis. Commun. Image Represent., 2015, 33, pp. 104–114 [7] Liu, P., Guo, J.M., Chamnongthai, K., et al.: ‘Fusion of color histogram and LBP-based features for texture image retrieval and classification’, Inf. Sci. (NY), 2017, 390, pp. 95–111 [8] Walia, E., Pal, A.: ‘Fusion framework for effective color image retrieval’, J. Vis. Commun. Image Represent., 2014, 25, (6), pp. 1335–1348 [9] Wang, X., Wang, Z.: ‘A novel method for image retrieval based on structure elements’ descriptor’, J. Vis. Commun. Image Represent., 2013, 24, (1), pp. 63–74 [10] Tang, Y., Wu, X.: ‘Scene text detection and segmentation based on cascaded convolution neural networks’, IEEE Trans. Image Process., 2017, 26, (3), pp. 1509–1520 [11] Wei, Y., Zhang, Z., Shen, W., et al.: ‘Text detection in scene images based on exhaustive segmentation’, Signal Process. Image Commun., 2017, 50, pp. 1–8 [12] Unar, S., Jalbani, A.H., Shaikh, M., et al.: ‘A study on text detection and localization techniques for natural scene images’, Int. J. Comput. Sci. Netw. Secur., 2018, 18, (1), pp. 99–111 [13] Zheng, Y., Li, Q., Liu, J., et al.: ‘A cascaded method for text detection in natural scene images’, Neurocomputing, 2017, 238, pp. 307–315 [14] Unar, S., Jalbani, A.H., Jawaid, M.M., et al.: ‘Artificial Urdu text detection and localization from individual video frames’, Mehran Univ. Res. J. Eng. Technol., 2018, 37, (2), pp. 429–438 [15] Ezaki, N., Bulacu, M., Schomaker, L.: ‘Text detection from natural scene images: towards a system for visually impaired persons’. 17th Int. Conf. Pattern Recognition (ICPR), 2004, vol. 2, pp. 683–686 [16] Zhou, G., Liu, Y., Meng, Q., et al.: ‘Detecting multilingual text in natural scene’. Proc. 2011 First Int. Symp. Access Spaces ISAS 2011, 2011, pp. 116– 120 [17] Zhang, J., Kasturi, R.: ‘Text detection using edge gradient and graph spectrum’. Proc. Int. Conf. Pattern Recognition, 2010, pp. 3979–3982 [18] Epshtein, B., Ofek, E., Wexler, Y.: ‘Detecting text in natural scenes with stroke width transform’. Proc. IEEE Computer Society Conf. Computer Vision Pattern Recognition, 2010, pp. 2963–2970 [19] Ma, L., Wang, C., Xiao, B.: ‘Text detection in natural images based on multi- scale edge detection and classification’. 2010 Third Int. Congress Image Signal Processing, 2010, vol. 4, pp. 1961–1965 [20] Neumann, L., Matas, J.: ‘A method for text localization and recognition in real-world images’. Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2011 (LNCS, 6494), (PART 3), pp. 770–783 [21] Yi, C., Tian, Y.: ‘Text string detection from natural scenes by structure-based partition and grouping’, IEEE Trans. Image Process., 2011, 20, (9), pp. 2594– 2605 [22] Yang, H.-Y., Li, Y.-W., Li, W.-Y., et al.: ‘Content-based image retrieval using local visual attention feature’, J. Vis. Commun. Image Represent., 2014, 25, (6), pp. 1308–1323 [23] Wang, Y., Cen, Y., Zhao, R., et al.: ‘Separable vocabulary and feature fusion for image retrieval based on sparse representation’, Neurocomputing, 2017, 236, pp. 14–22 [24] Dimitrovski, I., Kocev, D., Loskovska, S., et al.: ‘Improving bag-of-visual- words image retrieval with predictive clustering trees’, Inf. Sci. (NY), 2016, 329, pp. 851–865 [25] Li, S., Purushotham, S., Chen, C., et al.: ‘Measuring and predicting tag importance for image retrieval’, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 8828, (c), pp. 1–14 [26] Wu, L., Jin, R., Jain, A.K.: ‘Tag completion for image retrieval’, IEEE Trans. Pattern Anal. Mach. Intell., 2013, 35, (3), pp. 716–727 [27] Liu, D., Wang, M., Yang, L., et al.: ‘Tag quality improvement for social images’. Proc. 2009 IEEE Int. Conf. Multimedia Expo ICME 2009, 2009, pp. 350–353 [28] Cui, C., Lin, P., Nie, X., et al.: ‘Hybrid textual–visual relevance learning for content-based image retrieval’, J. Vis. Commun. Image Represent., 2017, 48, pp. 367–374 [29] Neumann, L., Matas, J.: ‘Real-time scene text localization and recognition’. Proc. IEEE Computer Society Conf. Computer Vision Pattern Recognition, 2012, pp. 3538–3545 [30] Shi, C., Wang, C., Xiao, B., et al.: ‘Scene text detection using graph model built upon maximally stable extremal regions’, Pattern Recognit. Lett., 2013, 34, (2), pp. 107–116 [31] Felhi, M., Bonnier, N., Tabbone, S.: ‘A skeleton based descriptor for detecting text in real scene images’. 2012 21st Int. Conf. Pattern Recognition (ICPR), 2012, pp. 282–285 [32] Matas, J., Chum, O., Urban, M., et al.: ‘Robust wide-baseline stereo from maximally stable extremal regions’, Image Vis. Comput., 2004, 22, (10), pp. 761–767 [33] Bergasa, L.M., Yebes, J.J.: ‘Text location in complex images’. Int. Conf. Pattern Recognition (ICPR), 2012, pp. 617–620 [34] Chen, H., Tsai, S.S., Schroth, G., et al.: ‘Robust text detection in natural images with edge-enhanced maximally stable extremal regions’. 2011 18th IEEE Int. Conf. Image Processing (ICIP), 2011, pp. 3–6 [35] Li, Y., Lu, H.: ‘Scene text detection via stroke width’. Proc. Int. Conf. Pattern Recognition, 2012, pp. 681–684 [36] Jung, C., Liu, Q., Kim, J.: ‘A stroke filter and its application to text localization’, Pattern Recognit. Lett., 2009, 30, (2), pp. 114–122 [37] Lucas, S.M., Panaretos, A., Sosa, L., et al.: ‘ICDAR 2003 robust reading competitions’. Proc. Int. Conf. Document Analysis Recognition ICDAR, 2003, pp. 682–687 [38] Kim, Y., Jernite, Y., Sontag, D., et al.: ‘Character-aware neural language models’. Proc. 30th AAAI Conf. Artificial Intelligence, 2016, pp. 2741–2749 [39] Pan, Y.F., Liu, C.L., Hou, X.: ‘Fast scene text localization by learning-based filtering and verification’. Proc. Int. Conf. Image Processing ICIP, 2010, pp. 2269–2272 [40] Kai, W., Babenko, B., Belongie, S.: ‘End-to-end scene text recognition’. 2011 IEEE Int. Conf. Computer Vision (ICCV), 2011, no. 4, pp. 1457–1464 [41] Wang, T., Wu, D.J., Coates, A, et al.: ‘End-to-end text recognition with convolutional neural networks’. 21st Int. Conf. Pattern Recognition, 2012, pp. 3304–3308 [42] Liu, G.-H., Yang, J.-Y., Li, Z.: ‘Content-based image retrieval using computational visual attention model’, Pattern Recognit., 2015, 48, (8), pp. 2554–2566 IET Image Process., 2019, Vol. 13 Iss. 3, pp. 515-521 © The Institution of Engineering and Technology 2018 521