Improving Accessibility of
Archived Raster Dictionaries of
Complex Script Languages
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Sawood Alam
National University of Sciences and Technology
Islamabad, Pakistan
Fateh ud din B Mehmood
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Michael L. Nelson
The Time Travel
OK Google, Define Dictionary
a book or electronic resource that lists
the words of a language (typically in
alphabetical order) and gives their
meaning, or gives the equivalent words
in a different language, often also
providing information about
pronunciation, origin, and usage.
Dictionaries Are Different
Read: random access
Write: maintain sort order
The most compact mode to
preserve a language
Problem: English Dictionary
Johnson's English dictionary
Problem: Urdu Dictionary
Farhang-e-Asifiyah
Related Work
Unicode Collation
Ordered assembly of written information
Unicode values != natural collation
Arabic script: U+0600 to U+06FF
Out of order alphabets in derived languages
Common Locale Data Repository (CLDR)
Collation Discrepancies
Compound letters
Diacritical marks
Half letters
Prefixes
Nested Ordering
Root word sorting (Arabic)
Morphological derivation
Derived word simplification
Radicals and strokes (Chinese)
Indexing: Ordered Pages
Indexing: Sparse Index
Indexing: Full Index
Indexing: Location Index
Indexing State Transition
Annotation
Digitization
Dictionary Explorer
Multilingual Multi-dictionary Lookup
Searching and Exploring
Annotation and digitization
User Contribution and Feedback
Open Source => GitHub:/urduweb/DictionaryExplorer
Dictionary Explorer: English
Dictionary Explorer: English
Dictionary Explorer: Urdu
Dictionary Explorer: Urdu
Indexing Time
Dictionary Pages Index Mode Time
English to
Urdu
180 Sparse Manual and
Script
10
minutes
Monolingual
Urdu
2,500 Sparse Manual 2 hours
Monolingual
Classic Urdu
3,200 Full* Crowdsource** 60 days
* 75,000 words, phrases, proverbs, and idioms
** 13 contributors
Prefix Permutations
Prefix: One
Prefix: Two
Prefix: Three
Prefix: Four
Prefix: Five
Prefix: Six
Conclusions and Future Work
Identified issues
Too many matches
Lack of fielded searching
Lack of OCR support
No input method assistance
Collation chalanges
Accessibility levels: Ordered Pages, Sparse, Full, and
Location indexes, annotation, and digitization
Implemented a multi-lingual multi-dictionary explorer
Effort and prefix evaluation
In future: elastic index and automatic region estimste
GitHub:/urduweb/DictionaryExplorer
Sawood Alam
@ibnesayeed

Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages