Dr. Kareem Darwish's presentation at QITCOM 2011


Published on

May 24 | Day 1 | INNOVATE

Session 2: Digitizing Arabic Content - Lead the Way

Speaker: Dr. Kareem Darwish, Arabic Language Technology Senior Scientist - Qatar Computing Research Institute, Qatar Foundation

Topic: E-Learning: The Future of Arabic Digital Content

For more information visit www.qitcom.com.qa

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dr. Kareem Darwish's presentation at QITCOM 2011

  1. 1. Digitizing and Retrieving Printed Arabic Documents<br />Kareem Darwish<br />Senior Scientist<br />Qatar Computing Research Institute <br />
  2. 2. Overview<br />Scanning<br />Some Magic<br />Search results<br />
  3. 3. Scanning<br />http://en.wikipedia.org/wiki/Book_scanning<br />
  4. 4. Scanning<br />http://en.wikipedia.org/wiki/Book_scanning<br />
  5. 5. Scanning<br />http://www.kirtas.com<br />
  6. 6. Scanning<br />http://www.kirtas.com<br />
  7. 7. Result of Scanning<br />http://www.colophon.com<br />Courtesy of the Library of Alexandria<br />
  8. 8. Magic: Optical Character Recognition<br />من ناحيتى المراقبة والنيران على السهل الساحلى. <br />وطرق الاقتراب التى تسلكها أى قوات عربية من ناحية الشرتى تنح ر دى طرق <br />خمسة أهمها الثلاثة التالية: <br />ا- الطريق ا لأول وهو ا لأقصر من بغداد- هـ 2233 " المفرتى.، أو ا لانحراف إلى <br />الرطبة قبل 3أ3 دمشق الأردن،<br />محاور ا لارابى من العرا إلى سوريا وا لأردد<br />2- الطريق الاثانى من بغداد " أبو كمال- بالميرا- دمشق- ا لأردد. <br />3- الطريق الثالسث وهو الأطول من بغ داد- الموصل- دير الزور- حملرو- دمشق-<br />ا لأردن " 68 <br />Courtesy of the Library of Alexandria<br />OCR output (Sakhr)<br />
  9. 9. Arabic OCR is Hard<br />Letters change shape depending on position in word, with dots distinguishing them from each other<br />تـ ، ـتـ ، ـت<br />قـ ، ـقـ ، ـق ، ق<br />Diacritics are optional<br />ق ، قَ ، قِ ، قُ ، قَّ ، قْ<br />Some letter combinations have special shapes (ligatures):<br />ل + ا = لا<br />Letter elongations (Kashida) are often used<br />قبل قبـــــــــــــــــــــــــــل<br />Letters are connected<br />
  10. 10. Arabic OCR is Hard<br />Diacritics and dots easily confusable. If manuscript is old,<br />they can be confused with speckle on page<br />Word error rate is typically greater than 20% ! <br />
  11. 11. Arabic OCR is Hard<br />Typical OCR output<br />وتامسوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدنالذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب<br />
  12. 12. Arabic Morphology Challenges<br />Arabic uses complex derivational morphology:<br />Root (ex. ktb)<br />Stem – root in a template (ex. mkAtbp)<br />Word – stem with optional determiner, preposition, coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+AtwAlmkAtbAt)<br />Estimated number of possible words: 60 billion<br />Morphology dictates diacritics, which change meaning<br />Ex. Elm  (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge)<br />No specific writing standard is prevalent:<br />Ex. The trailing letters in Ely (Ali) and ElY (on) are often interchanged<br />
  13. 13. Arabic Morphology<br />For regular Arabic search, morphological analysis is typically used:<br />Full morphological analysis:<br />Sebawai, Buckwalter, IBM Lee, AMIRA<br />Light stemming – remove common prefixes and suffixes<br />Al-Stem or Light-10<br />For OCR they fail<br />
  14. 14. OCR Error Handling<br />Error correction:<br />Word level techniques:<br />Dictionary lookup (Jurafsky & Martin, 2000) <br />Character level model uses confusion matrix<br />Typically font dependent<br />Character n-gram model:<br />Some character sequences are more common than others<br />Presence of a rare character sequence indicates position of error<br />argmax P ( WordOrg| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg) <br />Char level model<br />Word level model<br />
  15. 15. OCR Error Handling<br />Error correction:<br />Passage level/context sensitive techniques:<br />Using language modeling (bi or trigram LM):<br />Clustering words in passage:<br />assumes salient terms appear more than once:<br />Ex. Kennedy; Kemedy; Kennody; etc.<br />P ( Wordoriginal| WordOCR) = <br /> P ( WordOCR| WordOrg) P ( WordOrg) <br />P(WordOrg| WordOrg-1)<br />
  16. 16. OCR Error Handling<br />Multi-source fusion:<br />Uses language modeling to fuse the output of multiple OCR systems<br />Query garbling:<br />Use a character level model to generate multiple degraded versions of a query<br />Ex.: cement => cement, cornent, cernont, etc.<br />Set degraded versions of a term as synonyms <br />
  17. 17. Arabic OCR Text Retrieval<br />Without error handling  Use character n-grams (3 & 4-grams)<br />وتام سوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدن الذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب<br />رالفجار والفجار<br />رال ، الف ، فجا ، جار<br />وال ، الف ، فجا ، جار<br />
  18. 18. Presenting Results<br />Presenting OCR output to users is not an option<br />How would a ranked list of images look like<br />How would we generate image snippets?<br />How do we highlight salient terms in these images?<br />
  19. 19. Presenting Results<br />What is the unit of search?<br />Is it book, chapter, page<br />
  20. 20. Concluding Remarks<br />Scanning is a fairly mature technology<br />Arabic OCR has quite a ways to go<br />Quality of search is tied to the quality of OCR<br />Presentation Issues persist<br />