Querying the Cairo Genizah Images with Word-Spotting Algorithm (En)
Adiel Ben-Shalom, Prof. Yaacov Choueka, The Friedberg Genizah Project, Prof. Nachum Dershowitz, Prof. Lior Wolf, Tel Aviv University
Radiant Call girls in Dubai O56338O268 Dubai Call girls
D01 choueka dershowitz_word_spotting_algorithm
1. Querying a Large Corpus of
Historical Handwritten Manuscipts
Using Word-Spotting Alagorithms
Yaacov Choueka, Adiel ben-Shalom
The Friedberg Genizah Project
Nachum Dershowitz, Lior Wolf, Adi Silberfenig
School of Computer Science, Tel Aviv University
Minerva 2015,
Jerusalem
2. The Problem: find all occurrences of a
given query-word in all the manuscripts
of the corpus
(arbitrary language, arbitrary script)
Example:
The Cairo Genizah Corpus
360,000 fragments
Hebrew characters, Hebrew and Arabic languages
The query: בראשית
5. The catch:
The software can search only
manuscripts that have been
transcribed into electronic form!
Usually, however, most of the manuscripts
are never transcribed!
In the Genizah case:
480,000 images are available
only 40,000 (8%) have been transcribed!
7. Search for the image
of the query word
(and not for its text)
The word-spotting approach:
8. Given one (or more)
image(s)
of a query word,
find all occurrences of
similar images in the
corpus collection of
manuscripts’ images
Query:
Word-spotting
13. 4. Image descriptors for every patch
Constructing, for every patch
an image-descriptor vector of
12,460 real numbers
140 cells * (31+58)=12,460
(31 features of HOG vector)
(58 features of LBP vector)
14. 5. Dimension Reduction
12,460
M
Patch 1
Patch 2
Patch 3
Patch M
M = Total Number of Patches
In all images of the corpus
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
PCA – Principal
Component Analysis
15. 6. Similarity Computation
Computing an efficient
similarity measure
between
the query-reduced vector
and
the reduced vectors
of all patches of all
images in the corpus
QueryDataset
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
Query Patch 1
1000
Result
M
Similarity of Query
Patch to Patch
number i
16. 7. Result
Sort the results by decreasing similarity
and display the patches with the best
similarity to the query
17. Two Tests
Precision 50% 91%
Single query 0.08 sec 0.03 sec
Pre-processing per Page 46 sec 3 sec
1. George Washington – Handwritten
2. Lord Byron – Printed
20 pages, about 5000 words each
18. Current Problems
1. Efficiently building (off-line, in terms
of space and time) compact image-
descriptors for all patches from all
(half-a-million) images.
2. Building an efficient (on-line) system
for comparing the query vector to all
(100 million?) patches’ vectors
19. When solved and implemented
it will offer
new horizons
to the study of large corpora
of historical documents