Dr. Kareem Darwish's presentation at QITCOM 2011
Upcoming SlideShare
Loading in...5
×
 

Dr. Kareem Darwish's presentation at QITCOM 2011

on

  • 487 views

QITCOM 2011...

QITCOM 2011
May 24 | Day 1 | INNOVATE

Session 2: Digitizing Arabic Content - Lead the Way

Speaker: Dr. Kareem Darwish, Arabic Language Technology Senior Scientist - Qatar Computing Research Institute, Qatar Foundation

Topic: E-Learning: The Future of Arabic Digital Content

For more information visit www.qitcom.com.qa

Statistics

Views

Total Views
487
Views on SlideShare
487
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Dr. Kareem Darwish's presentation at QITCOM 2011 Dr. Kareem Darwish's presentation at QITCOM 2011 Presentation Transcript

    • Digitizing and Retrieving Printed Arabic Documents
      Kareem Darwish
      Senior Scientist
      Qatar Computing Research Institute
    • Overview
      Scanning
      Some Magic
      Search results
    • Scanning
      http://en.wikipedia.org/wiki/Book_scanning
    • Scanning
      http://en.wikipedia.org/wiki/Book_scanning
    • Scanning
      http://www.kirtas.com
    • Scanning
      http://www.kirtas.com
    • Result of Scanning
      http://www.colophon.com
      Courtesy of the Library of Alexandria
    • Magic: Optical Character Recognition
      من ناحيتى المراقبة والنيران على السهل الساحلى.
      وطرق الاقتراب التى تسلكها أى قوات عربية من ناحية الشرتى تنح ر دى طرق
      خمسة أهمها الثلاثة التالية:
      ا- الطريق ا لأول وهو ا لأقصر من بغداد- هـ 2233 " المفرتى.، أو ا لانحراف إلى
      الرطبة قبل 3أ3 دمشق الأردن،
      محاور ا لارابى من العرا إلى سوريا وا لأردد
      2- الطريق الاثانى من بغداد " أبو كمال- بالميرا- دمشق- ا لأردد.
      3- الطريق الثالسث وهو الأطول من بغ داد- الموصل- دير الزور- حملرو- دمشق-
      ا لأردن " 68
      Courtesy of the Library of Alexandria
      OCR output (Sakhr)
    • Arabic OCR is Hard
      Letters change shape depending on position in word, with dots distinguishing them from each other
      تـ ، ـتـ ، ـت
      قـ ، ـقـ ، ـق ، ق
      Diacritics are optional
      ق ، قَ ، قِ ، قُ ، قَّ ، قْ
      Some letter combinations have special shapes (ligatures):
      ل + ا = لا
      Letter elongations (Kashida) are often used
      قبل قبـــــــــــــــــــــــــــل
      Letters are connected
    • Arabic OCR is Hard
      Diacritics and dots easily confusable. If manuscript is old,
      they can be confused with speckle on page
      Word error rate is typically greater than 20% !
    • Arabic OCR is Hard
      Typical OCR output
      وتامسوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدنالذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
    • Arabic Morphology Challenges
      Arabic uses complex derivational morphology:
      Root (ex. ktb)
      Stem – root in a template (ex. mkAtbp)
      Word – stem with optional determiner, preposition, coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+AtwAlmkAtbAt)
      Estimated number of possible words: 60 billion
      Morphology dictates diacritics, which change meaning
      Ex. Elm  (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge)
      No specific writing standard is prevalent:
      Ex. The trailing letters in Ely (Ali) and ElY (on) are often interchanged
    • Arabic Morphology
      For regular Arabic search, morphological analysis is typically used:
      Full morphological analysis:
      Sebawai, Buckwalter, IBM Lee, AMIRA
      Light stemming – remove common prefixes and suffixes
      Al-Stem or Light-10
      For OCR they fail
    • OCR Error Handling
      Error correction:
      Word level techniques:
      Dictionary lookup (Jurafsky & Martin, 2000)
      Character level model uses confusion matrix
      Typically font dependent
      Character n-gram model:
      Some character sequences are more common than others
      Presence of a rare character sequence indicates position of error
      argmax P ( WordOrg| WordOCR) = P ( WordOCR| WordOrg) P ( WordOrg)
      Char level model
      Word level model
    • OCR Error Handling
      Error correction:
      Passage level/context sensitive techniques:
      Using language modeling (bi or trigram LM):
      Clustering words in passage:
      assumes salient terms appear more than once:
      Ex. Kennedy; Kemedy; Kennody; etc.
      P ( Wordoriginal| WordOCR) =
      P ( WordOCR| WordOrg) P ( WordOrg)
      P(WordOrg| WordOrg-1)
    • OCR Error Handling
      Multi-source fusion:
      Uses language modeling to fuse the output of multiple OCR systems
      Query garbling:
      Use a character level model to generate multiple degraded versions of a query
      Ex.: cement => cement, cornent, cernont, etc.
      Set degraded versions of a term as synonyms
    • Arabic OCR Text Retrieval
      Without error handling  Use character n-grams (3 & 4-grams)
      وتام سوق الجنة والنار وبها تقاكظالخليقة إلى المؤمنين رالكفاروالأبا إر رالفجار فهى منشأ الخلق والأمر والثواب والعقاب ،وهىاهدن الذى خطقت له الخليقة رغها رعن حقرقها السمؤال والحساب
      رالفجار والفجار
      رال ، الف ، فجا ، جار
      وال ، الف ، فجا ، جار
    • Presenting Results
      Presenting OCR output to users is not an option
      How would a ranked list of images look like
      How would we generate image snippets?
      How do we highlight salient terms in these images?
    • Presenting Results
      What is the unit of search?
      Is it book, chapter, page
    • Concluding Remarks
      Scanning is a fairly mature technology
      Arabic OCR has quite a ways to go
      Quality of search is tied to the quality of OCR
      Presentation Issues persist