D01 choueka dershowitz_word_spotting_algorithm

Querying a Large Corpus of
Historical Handwritten Manuscipts
Using Word-Spotting Alagorithms
Yaacov Choueka, Adiel ben-Shalom
The Friedberg Genizah Project
Nachum Dershowitz, Lior Wolf, Adi Silberfenig
School of Computer Science, Tel Aviv University
Minerva 2015,
Jerusalem

The Problem: find all occurrences of a
given query-word in all the manuscripts
of the corpus
(arbitrary language, arbitrary script)
Example:
The Cairo Genizah Corpus
360,000 fragments
Hebrew characters, Hebrew and Arabic languages
The query: ‫בראשית‬

Simple Solution: full-text search

The catch:
The software can search only
manuscripts that have been
transcribed into electronic form!
Usually, however, most of the manuscripts
are never transcribed!
In the Genizah case:
480,000 images are available
only 40,000 (8%) have been transcribed!

OCR
Does not work well
for handwritten historical documents
‫את‬ ‫יהוה‬ ‫ישמע‬ ‫כי‬ ‫אהבתי‬
‫לי‬ ‫אוזנו‬ ‫הטה‬ ‫כי‬ ‫תחנוני‬ ‫קולי‬
‫מות‬ ‫חבלי‬ ‫אפפוני‬ ‫אקרא‬ ‫ובימי‬
‫צרה‬ ‫מצאוני‬ ‫שאול‬ ‫ומצרי‬
‫יהוה‬ ‫ובשם‬ ‫אמצא‬ ‫ויגון‬
‫מלטה‬ ‫יהוה‬ ‫אנה‬ ‫אקרא‬
‫ואלוהינו‬ ‫וצדיק‬ ‫יהוה‬ ‫חנון‬ ‫נפשי‬
‫יהוה‬ ‫פתאים‬ ‫שומר‬ ‫מרחם‬
‫נפשי‬ ‫שובי‬ ‫יהושיע‬ ‫ולי‬ ‫דלותי‬
‫עליכי‬ ‫גמל‬ ‫יהוה‬ ‫כי‬ ‫למנוחיכי‬
‫עיני‬ ‫את‬ ‫ממות‬ ‫נפשי‬ ‫חלצת‬ ‫כי‬
‫מדחי‬ ‫רגלי‬ ‫את‬ ‫מדמעה‬
‫בארצות‬ ‫יהוה‬ ‫לפני‬ ‫אתהלך‬
‫אני‬ ‫אדבר‬ ‫כי‬ ‫האמנתי‬ ‫החיים‬
‫אדזבעיכישעידודארוליעחנוניכי‬
‫דסראזנויוביסיארראאוניחבליש‬
‫תומצרישאולצאוניצדוגוןאמצאו‬
‫בשםידוארראאנאידודלטכשינון‬
‫ידודוצדידואדינוסרחסשוערתאי‬
‫סיזוזדלייייליידושיעשובינשילסנ‬
‫וחיכיכיידודגמלעיכיכיחלצתנשי‬
‫ממועאעעיניסדסעדאערגליאעד‬
‫לךלניידודבארדחייפדאסנעיכיא‬
‫דבראניגליאעדל‬
OCR Transcription

Search for the image
of the query word
(and not for its text)
The word-spotting approach:

Given one (or more)
image(s)
of a query word,
find all occurrences of
similar images in the
corpus collection of
manuscripts’ images
Query:
Word-spotting

2. Extracting Word-Candidates
(“Patches”) From a Manuscript’s Image

3. Patch Normalization
Normalizing every patch into a standard grid
of 8960 pixels (20*7 cells of 8*8 pixels each)

4. Image descriptors for every patch
Constructing, for every patch
an image-descriptor vector of
12,460 real numbers
140 cells * (31+58)=12,460
(31 features of HOG vector)
(58 features of LBP vector)

5. Dimension Reduction
12,460
M
Patch 1
Patch 2
Patch 3
Patch M
M = Total Number of Patches
In all images of the corpus
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
PCA – Principal
Component Analysis

6. Similarity Computation
Computing an efficient
similarity measure
between
the query-reduced vector
and
the reduced vectors
of all patches of all
images in the corpus
QueryDataset
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
Query Patch 1
1000
Result
M
Similarity of Query
Patch to Patch
number i

7. Result
Sort the results by decreasing similarity
and display the patches with the best
similarity to the query

Two Tests
Precision 50% 91%
Single query 0.08 sec 0.03 sec
Pre-processing per Page 46 sec 3 sec
1. George Washington – Handwritten
2. Lord Byron – Printed
20 pages, about 5000 words each

Current Problems
1. Efficiently building (off-line, in terms
of space and time) compact image-
descriptors for all patches from all
(half-a-million) images.
2. Building an efficient (on-line) system
for comparing the query vector to all
(100 million?) patches’ vectors

When solved and implemented
it will offer
new horizons
to the study of large corpora
of historical documents

D01 choueka dershowitz_word_spotting_algorithm

Recommended

Recommended

More Related Content

Similar to D01 choueka dershowitz_word_spotting_algorithm

Similar to D01 choueka dershowitz_word_spotting_algorithm (15)

More from evaminerva

More from evaminerva (20)

Recently uploaded

Recently uploaded (20)

D01 choueka dershowitz_word_spotting_algorithm