SlideShare a Scribd company logo
1 of 20
Querying a Large Corpus of
Historical Handwritten Manuscipts
Using Word-Spotting Alagorithms
Yaacov Choueka, Adiel ben-Shalom
The Friedberg Genizah Project
Nachum Dershowitz, Lior Wolf, Adi Silberfenig
School of Computer Science, Tel Aviv University
Minerva 2015,
Jerusalem
The Problem: find all occurrences of a
given query-word in all the manuscripts
of the corpus
(arbitrary language, arbitrary script)
Example:
The Cairo Genizah Corpus
360,000 fragments
Hebrew characters, Hebrew and Arabic languages
The query: ‫בראשית‬
Simple Solution: full-text search
KWIC Output
The catch:
The software can search only
manuscripts that have been
transcribed into electronic form!
Usually, however, most of the manuscripts
are never transcribed!
In the Genizah case:
480,000 images are available
only 40,000 (8%) have been transcribed!
OCR
Does not work well
for handwritten historical documents
‫את‬ ‫יהוה‬ ‫ישמע‬ ‫כי‬ ‫אהבתי‬
‫לי‬ ‫אוזנו‬ ‫הטה‬ ‫כי‬ ‫תחנוני‬ ‫קולי‬
‫מות‬ ‫חבלי‬ ‫אפפוני‬ ‫אקרא‬ ‫ובימי‬
‫צרה‬ ‫מצאוני‬ ‫שאול‬ ‫ומצרי‬
‫יהוה‬ ‫ובשם‬ ‫אמצא‬ ‫ויגון‬
‫מלטה‬ ‫יהוה‬ ‫אנה‬ ‫אקרא‬
‫ואלוהינו‬ ‫וצדיק‬ ‫יהוה‬ ‫חנון‬ ‫נפשי‬
‫יהוה‬ ‫פתאים‬ ‫שומר‬ ‫מרחם‬
‫נפשי‬ ‫שובי‬ ‫יהושיע‬ ‫ולי‬ ‫דלותי‬
‫עליכי‬ ‫גמל‬ ‫יהוה‬ ‫כי‬ ‫למנוחיכי‬
‫עיני‬ ‫את‬ ‫ממות‬ ‫נפשי‬ ‫חלצת‬ ‫כי‬
‫מדחי‬ ‫רגלי‬ ‫את‬ ‫מדמעה‬
‫בארצות‬ ‫יהוה‬ ‫לפני‬ ‫אתהלך‬
‫אני‬ ‫אדבר‬ ‫כי‬ ‫האמנתי‬ ‫החיים‬
‫אדזבעיכישעידודארוליעחנוניכי‬
‫דסראזנויוביסיארראאוניחבליש‬
‫תומצרישאולצאוניצדוגוןאמצאו‬
‫בשםידוארראאנאידודלטכשינון‬
‫ידודוצדידואדינוסרחסשוערתאי‬
‫סיזוזדלייייליידושיעשובינשילסנ‬
‫וחיכיכיידודגמלעיכיכיחלצתנשי‬
‫ממועאעעיניסדסעדאערגליאעד‬
‫לךלניידודבארדחייפדאסנעיכיא‬
‫דבראניגליאעדל‬
OCR Transcription
Search for the image
of the query word
(and not for its text)
The word-spotting approach:
Given one (or more)
image(s)
of a query word,
find all occurrences of
similar images in the
corpus collection of
manuscripts’ images
Query:
Word-spotting
Query:
1. Binarization
2. Extracting Word-Candidates
(“Patches”) From a Manuscript’s Image
3. Patch Normalization
Normalizing every patch into a standard grid
of 8960 pixels (20*7 cells of 8*8 pixels each)
4. Image descriptors for every patch
Constructing, for every patch
an image-descriptor vector of
12,460 real numbers
140 cells * (31+58)=12,460
(31 features of HOG vector)
(58 features of LBP vector)
5. Dimension Reduction
12,460
M
Patch 1
Patch 2
Patch 3
Patch M
M = Total Number of Patches
In all images of the corpus
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
PCA – Principal
Component Analysis
6. Similarity Computation
Computing an efficient
similarity measure
between
the query-reduced vector
and
the reduced vectors
of all patches of all
images in the corpus
QueryDataset
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
Query Patch 1
1000
Result
M
Similarity of Query
Patch to Patch
number i
7. Result
Sort the results by decreasing similarity
and display the patches with the best
similarity to the query
Two Tests
Precision 50% 91%
Single query 0.08 sec 0.03 sec
Pre-processing per Page 46 sec 3 sec
1. George Washington – Handwritten
2. Lord Byron – Printed
20 pages, about 5000 words each
Current Problems
1. Efficiently building (off-line, in terms
of space and time) compact image-
descriptors for all patches from all
(half-a-million) images.
2. Building an efficient (on-line) system
for comparing the query vector to all
(100 million?) patches’ vectors
When solved and implemented
it will offer
new horizons
to the study of large corpora
of historical documents
Thank You

More Related Content

Similar to D01 choueka dershowitz_word_spotting_algorithm

OCR with MXNet Gluon
OCR with MXNet GluonOCR with MXNet Gluon
OCR with MXNet GluonApache MXNet
 
Audio Fingerprinting Introduction
Audio Fingerprinting IntroductionAudio Fingerprinting Introduction
Audio Fingerprinting IntroductionVikesh Khanna
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Vienna Data Science Group
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017MLconf
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behindVince Smith
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석datasciencekorea
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Dominic Suciu
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Petros Tsonis
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 

Similar to D01 choueka dershowitz_word_spotting_algorithm (15)

OCR with MXNet Gluon
OCR with MXNet GluonOCR with MXNet Gluon
OCR with MXNet Gluon
 
Digitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy databaseDigitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy database
 
Audio Fingerprinting Introduction
Audio Fingerprinting IntroductionAudio Fingerprinting Introduction
Audio Fingerprinting Introduction
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behind
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 

More from evaminerva

G14 eyal reuven_nli_theopenlibrary
G14 eyal reuven_nli_theopenlibraryG14 eyal reuven_nli_theopenlibrary
G14 eyal reuven_nli_theopenlibraryevaminerva
 
G12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewishG12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewishevaminerva
 
G12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewishG12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewishevaminerva
 
G11 alex valdman_yerushaproject
G11 alex valdman_yerushaprojectG11 alex valdman_yerushaproject
G11 alex valdman_yerushaprojectevaminerva
 
G11 alex valdman_yerushaproject
G11 alex valdman_yerushaprojectG11 alex valdman_yerushaproject
G11 alex valdman_yerushaprojectevaminerva
 
G10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminologyG10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminologyevaminerva
 
G10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminologyG10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminologyevaminerva
 
G8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishcultureG8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishcultureevaminerva
 
G8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishcultureG8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishcultureevaminerva
 
G7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariantsG7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariantsevaminerva
 
G7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariantsG7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariantsevaminerva
 
G6 jonathan bendovsqe_minerva 2016
G6 jonathan bendovsqe_minerva 2016G6 jonathan bendovsqe_minerva 2016
G6 jonathan bendovsqe_minerva 2016evaminerva
 
G5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrollsG5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrollsevaminerva
 
G5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrollsG5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrollsevaminerva
 
G3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritageG3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritageevaminerva
 
G3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritageG3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritageevaminerva
 
G2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestineG2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestineevaminerva
 
G2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestineG2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestineevaminerva
 
F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016evaminerva
 
F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016evaminerva
 

More from evaminerva (20)

G14 eyal reuven_nli_theopenlibrary
G14 eyal reuven_nli_theopenlibraryG14 eyal reuven_nli_theopenlibrary
G14 eyal reuven_nli_theopenlibrary
 
G12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewishG12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewish
 
G12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewishG12 susan hazan_roundtableopenaccesjewish
G12 susan hazan_roundtableopenaccesjewish
 
G11 alex valdman_yerushaproject
G11 alex valdman_yerushaprojectG11 alex valdman_yerushaproject
G11 alex valdman_yerushaproject
 
G11 alex valdman_yerushaproject
G11 alex valdman_yerushaprojectG11 alex valdman_yerushaproject
G11 alex valdman_yerushaproject
 
G10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminologyG10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminology
 
G10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminologyG10 ronit gadish_alexandervainer_hebrewterminology
G10 ronit gadish_alexandervainer_hebrewterminology
 
G8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishcultureG8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishculture
 
G8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishcultureG8 seroussi sprinzak_mappingjewishculture
G8 seroussi sprinzak_mappingjewishculture
 
G7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariantsG7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariants
 
G7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariantsG7 menahem katz_hillelgershuni_textualvariants
G7 menahem katz_hillelgershuni_textualvariants
 
G6 jonathan bendovsqe_minerva 2016
G6 jonathan bendovsqe_minerva 2016G6 jonathan bendovsqe_minerva 2016
G6 jonathan bendovsqe_minerva 2016
 
G5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrollsG5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrolls
 
G5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrollsG5 orit rosengarten_leonlevy_dl_deadseascrolls
G5 orit rosengarten_leonlevy_dl_deadseascrolls
 
G3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritageG3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritage
 
G3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritageG3 stoeck and_hayim_lapin_nextgenerationculturalheritage
G3 stoeck and_hayim_lapin_nextgenerationculturalheritage
 
G2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestineG2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestine
 
G2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestineG2 michale satlow_inscriptionsisraelpalestine
G2 michale satlow_inscriptionsisraelpalestine
 
F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016
 
F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016F3 sigal arieerez_reconnectingpast_evaminerva2016
F3 sigal arieerez_reconnectingpast_evaminerva2016
 

Recently uploaded

Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsThierry TROUIN ☁
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsstephieert
 

Recently uploaded (20)

Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with Flows
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girls
 

D01 choueka dershowitz_word_spotting_algorithm

  • 1. Querying a Large Corpus of Historical Handwritten Manuscipts Using Word-Spotting Alagorithms Yaacov Choueka, Adiel ben-Shalom The Friedberg Genizah Project Nachum Dershowitz, Lior Wolf, Adi Silberfenig School of Computer Science, Tel Aviv University Minerva 2015, Jerusalem
  • 2. The Problem: find all occurrences of a given query-word in all the manuscripts of the corpus (arbitrary language, arbitrary script) Example: The Cairo Genizah Corpus 360,000 fragments Hebrew characters, Hebrew and Arabic languages The query: ‫בראשית‬
  • 5. The catch: The software can search only manuscripts that have been transcribed into electronic form! Usually, however, most of the manuscripts are never transcribed! In the Genizah case: 480,000 images are available only 40,000 (8%) have been transcribed!
  • 6. OCR Does not work well for handwritten historical documents ‫את‬ ‫יהוה‬ ‫ישמע‬ ‫כי‬ ‫אהבתי‬ ‫לי‬ ‫אוזנו‬ ‫הטה‬ ‫כי‬ ‫תחנוני‬ ‫קולי‬ ‫מות‬ ‫חבלי‬ ‫אפפוני‬ ‫אקרא‬ ‫ובימי‬ ‫צרה‬ ‫מצאוני‬ ‫שאול‬ ‫ומצרי‬ ‫יהוה‬ ‫ובשם‬ ‫אמצא‬ ‫ויגון‬ ‫מלטה‬ ‫יהוה‬ ‫אנה‬ ‫אקרא‬ ‫ואלוהינו‬ ‫וצדיק‬ ‫יהוה‬ ‫חנון‬ ‫נפשי‬ ‫יהוה‬ ‫פתאים‬ ‫שומר‬ ‫מרחם‬ ‫נפשי‬ ‫שובי‬ ‫יהושיע‬ ‫ולי‬ ‫דלותי‬ ‫עליכי‬ ‫גמל‬ ‫יהוה‬ ‫כי‬ ‫למנוחיכי‬ ‫עיני‬ ‫את‬ ‫ממות‬ ‫נפשי‬ ‫חלצת‬ ‫כי‬ ‫מדחי‬ ‫רגלי‬ ‫את‬ ‫מדמעה‬ ‫בארצות‬ ‫יהוה‬ ‫לפני‬ ‫אתהלך‬ ‫אני‬ ‫אדבר‬ ‫כי‬ ‫האמנתי‬ ‫החיים‬ ‫אדזבעיכישעידודארוליעחנוניכי‬ ‫דסראזנויוביסיארראאוניחבליש‬ ‫תומצרישאולצאוניצדוגוןאמצאו‬ ‫בשםידוארראאנאידודלטכשינון‬ ‫ידודוצדידואדינוסרחסשוערתאי‬ ‫סיזוזדלייייליידושיעשובינשילסנ‬ ‫וחיכיכיידודגמלעיכיכיחלצתנשי‬ ‫ממועאעעיניסדסעדאערגליאעד‬ ‫לךלניידודבארדחייפדאסנעיכיא‬ ‫דבראניגליאעדל‬ OCR Transcription
  • 7. Search for the image of the query word (and not for its text) The word-spotting approach:
  • 8. Given one (or more) image(s) of a query word, find all occurrences of similar images in the corpus collection of manuscripts’ images Query: Word-spotting
  • 11. 2. Extracting Word-Candidates (“Patches”) From a Manuscript’s Image
  • 12. 3. Patch Normalization Normalizing every patch into a standard grid of 8960 pixels (20*7 cells of 8*8 pixels each)
  • 13. 4. Image descriptors for every patch Constructing, for every patch an image-descriptor vector of 12,460 real numbers 140 cells * (31+58)=12,460 (31 features of HOG vector) (58 features of LBP vector)
  • 14. 5. Dimension Reduction 12,460 M Patch 1 Patch 2 Patch 3 Patch M M = Total Number of Patches In all images of the corpus 1000 M Patch 1 Patch 2 Patch 3 Patch M PCA – Principal Component Analysis
  • 15. 6. Similarity Computation Computing an efficient similarity measure between the query-reduced vector and the reduced vectors of all patches of all images in the corpus QueryDataset 1000 M Patch 1 Patch 2 Patch 3 Patch M Query Patch 1 1000 Result M Similarity of Query Patch to Patch number i
  • 16. 7. Result Sort the results by decreasing similarity and display the patches with the best similarity to the query
  • 17. Two Tests Precision 50% 91% Single query 0.08 sec 0.03 sec Pre-processing per Page 46 sec 3 sec 1. George Washington – Handwritten 2. Lord Byron – Printed 20 pages, about 5000 words each
  • 18. Current Problems 1. Efficiently building (off-line, in terms of space and time) compact image- descriptors for all patches from all (half-a-million) images. 2. Building an efficient (on-line) system for comparing the query vector to all (100 million?) patches’ vectors
  • 19. When solved and implemented it will offer new horizons to the study of large corpora of historical documents