Dr. Kareem Darwish's presentation at QITCOM 2011
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Dr. Kareem Darwish's presentation at QITCOM 2011

on

  • 523 views

QITCOM 2011...

QITCOM 2011
May 24 | Day 1 | INNOVATE

Session 2: Digitizing Arabic Content - Lead the Way

Speaker: Dr. Kareem Darwish, Arabic Language Technology Senior Scientist - Qatar Computing Research Institute, Qatar Foundation

Topic: E-Learning: The Future of Arabic Digital Content

For more information visit www.qitcom.com.qa

Statistics

Views

Total Views
523
Views on SlideShare
523
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Dr. Kareem Darwish's presentation at QITCOM 2011 Presentation Transcript

  • 1. Digitizing and Retrieving Printed Arabic Documents
    Kareem Darwish
    Senior Scientist
    Qatar Computing Research Institute
  • 2. Overview
    Scanning
    Some Magic
    Search results
  • 3. Scanning
    http://en.wikipedia.org/wiki/Book_scanning
  • 4. Scanning
    http://en.wikipedia.org/wiki/Book_scanning
  • 5. Scanning
    http://www.kirtas.com
  • 6. Scanning
    http://www.kirtas.com
  • 7. Result of Scanning
    http://www.colophon.com
    Courtesy of the Library of Alexandria
  • 8. Magic: Optical Character Recognition
    من ناحيتى المراقبة والنيران على السهل الساحلى.
    وطرق الاقتراب التى تسلكها أى قوات عربية من ناحية الشرتى تنح ر دى طرق
    خمسة أهمها الثلاثة التالية:
    ا- الطريق ا لأول وهو ا لأقصر من بغداد- هـ 2233 " المفرتى.، أو ا لانحراف إلى
    الرطبة قبل 3أ3 دمشق الأردن،
    محاور ا لارابى من العرا إلى سوريا وا لأردد
    2- الطريق الاثانى من بغداد " أبو كمال- بالميرا- دمشق- ا لأردد.
    3- الطريق الثالسث وهو الأطول من بغ داد- الموصل- دير الزور- حملرو- دمشق-
    ا لأردن " 68
    Courtesy of the Library of Alexandria
    OCR output (Sakhr)
  • 9. Arabic OCR is Hard
    Letters change shape depending on position in word, with dots distinguishing them from each other
    تـ ، ـتـ ، ـت
    قـ ، ـقـ ، ـق ، ق
    Diacritics are optional
    ق ، قَ ، قِ ، قُ ، قَّ ، قْ
    Some letter combinations have special shapes (ligatures):
    ل + ا = لا
    Letter elongations (Kashida) are often used
    قبل قبـــــــــــــــــــــــــــل
    Letters are connected
  • 10. Arabic OCR is Hard
    Diacritics and dots easily confusable. If manuscript is old,
    they can be confused with speckle on page
    Word error rate is typically greater than 20% !
  • 11. Arabic OCR is Hard
    Typical OCR output
    وتامسوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدنالذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
  • 12. Arabic Morphology Challenges
    Arabic uses complex derivational morphology:
    Root (ex. ktb)
    Stem – root in a template (ex. mkAtbp)
    Word – stem with optional determiner, preposition, coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+AtwAlmkAtbAt)
    Estimated number of possible words: 60 billion
    Morphology dictates diacritics, which change meaning
    Ex. Elm  (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge)
    No specific writing standard is prevalent:
    Ex. The trailing letters in Ely (Ali) and ElY (on) are often interchanged
  • 13. Arabic Morphology
    For regular Arabic search, morphological analysis is typically used:
    Full morphological analysis:
    Sebawai, Buckwalter, IBM Lee, AMIRA
    Light stemming – remove common prefixes and suffixes
    Al-Stem or Light-10
    For OCR they fail
  • 14. OCR Error Handling
    Error correction:
    Word level techniques:
    Dictionary lookup (Jurafsky & Martin, 2000)
    Character level model uses confusion matrix
    Typically font dependent
    Character n-gram model:
    Some character sequences are more common than others
    Presence of a rare character sequence indicates position of error
    argmax P ( WordOrg| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg)
    Char level model
    Word level model
  • 15. OCR Error Handling
    Error correction:
    Passage level/context sensitive techniques:
    Using language modeling (bi or trigram LM):
    Clustering words in passage:
    assumes salient terms appear more than once:
    Ex. Kennedy; Kemedy; Kennody; etc.
    P ( Wordoriginal| WordOCR) =
    P ( WordOCR| WordOrg) P ( WordOrg)
    P(WordOrg| WordOrg-1)
  • 16. OCR Error Handling
    Multi-source fusion:
    Uses language modeling to fuse the output of multiple OCR systems
    Query garbling:
    Use a character level model to generate multiple degraded versions of a query
    Ex.: cement => cement, cornent, cernont, etc.
    Set degraded versions of a term as synonyms
  • 17. Arabic OCR Text Retrieval
    Without error handling  Use character n-grams (3 & 4-grams)
    وتام سوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدن الذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
    رالفجار والفجار
    رال ، الف ، فجا ، جار
    وال ، الف ، فجا ، جار
  • 18. Presenting Results
    Presenting OCR output to users is not an option
    How would a ranked list of images look like
    How would we generate image snippets?
    How do we highlight salient terms in these images?
  • 19. Presenting Results
    What is the unit of search?
    Is it book, chapter, page
  • 20. Concluding Remarks
    Scanning is a fairly mature technology
    Arabic OCR has quite a ways to go
    Quality of search is tied to the quality of OCR
    Presentation Issues persist