Arabic spell checking approaches


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Arabic spell checking approaches

  1. 1. Arabic SpellCheckingApproachesBy: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrfSupervised by: Dr. Amal Al-SaifNatural language processing - CS465
  2. 2.  Introduction Common Arabic Spell Error Towards Automatic Spell Checking for Arabic Towards Arabic Spell-Checker Based on N-Grams Scores Automatic Stochastic Arabic Spelling Correction With Emphasison Space Insertions and Deletions Arabic Word Generation and Modeling for Spell Checking Improved Spelling Error Detection and Correction for Arabic ConclusionOutline
  3. 3.  Arabic language NLP applications Approaches for solving the Arabic spellchecking problemIntroduction
  4. 4. Common Arabic Spell Errors• Reading Errors{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ } { }• Hearing Errors{ }{ }{ }{ }{ }{ }{ }{ }{ }• Touch-Typing Errors• Morphological Errors• Editing Errors
  5. 5. 1.Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions• Stochastic-based approach for misspelling correction ofArabic text.• A context based on two-layer that is automaticallycorrect misspelled words in large datasets.
  6. 6. Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions (Cont.)Candidates’ generationcomponentError detectionBest candidate selectioncomponentSingle-WordErrorsSpace DeletionErrorsSpace InsertionErrors
  7. 7. Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions (Cont.)Candidates’ generation component: Space Deletion Errors
  8. 8. Result• A standard Arabic text corpus (TRN_DB_I)• An extra standard Arabic text corpus(TRN_DB_II)• The test data (TST_DB)• The testing results show that as we increase the sizeof the training set, the performance improves reaching97.9% of F1 score for detection and 92.3% of F1score for correction.
  9. 9. 2.Towards Automatic Spell Checking for Arabic• Developing an Arabic spelling checker program.• Using SICStus Prolog language.• Recognizes common Arabic spelling errors and offerssuggestions for error correction.• Be able to recognize common spelling errors for standardArabic and Egyptian dialects.• Can be integrated with other text processing software, suchas word processors.
  10. 10. • Analysis of the common spelling errors that are used fordetecting the misspelled Arabic word.• Limited the detection of spelling errors to isolated words (non–word). e.g. ‘ ’ for ‘ ’.• Perform a series of heuristicsteps to find a replacementcandidate:Add missingcharacterReplace incorrectcharacterRemove excessivecharacterAdd a space tosplit wordsTowards Automatic Spell Checking for Arabic(Cont.)
  11. 11. • e.g. the candidates of the misspelled wordare : ,Add missingcharacter• e.g. the candidates of the misspelled wordare : , ,Replaceincorrectcharacter• e.g. the candidates of the misspelled wordare : ,Removeexcessivecharacter• e.g. the candidates of the misspelled wordare : ,Add a spaceto split wordsTowards Automatic Spell Checking forArabic(Cont.)
  12. 12. Neighbors tableTowards Automatic Spell Checking for Arabic(Cont.)
  13. 13. • Developing a simple and flexible spell-checker for Arabiclanguage (detect errors).• Based on N-Grams scores.• Using matrix approach.• The corpus which is used is adapted from Muaidi PHDthesis .• It is consists of 101,987 word types.3.Towards Arabic Spell-Checker Based onN-Grams Scores
  14. 14. Entered thetested textTokenizingprocessCleaningprocessMatrix methoddeals witheach wordTowards Arabic Spell-Checker Based onN-Grams Scores(Cont.)
  15. 15. • Building the matrices Number of matrices = longest word in corpus – 1. Dimension of each matrix is 28 28( for Arabic letter). (M1) for the combination of the first and the second lettersin a word. (M2) for the combination of the second and thethird letters in a word and so on. All the matrices are initialized by zeros.Matrix Method Deals With Each Word
  16. 16. • 2-Gram set (S) Each item in (S) consists of two letters. The item will assign the value 1 or 2 Assigned 2 in the corresponding matrix; if the word isended by these two letters. Assigned 1 if there is a connection and the word is notover yet. e.g. for the word:[ ]the 2-Gram set is S = { }M1[ ] [ ] = 1, M [ ] [ ] = 1, M3[ ][ ]=2.Matrix Method Deals With Each Word
  17. 17. Entered thetested textTokenizingprocessCleaningprocessMatrix methoddeals witheach wordMatrix Method Deals With Each Word(Cont.)
  18. 18. Result• The training dataset consists of 71,390 Arabic words (70%)and While the testing dataset consists of 30,597 Arabic words(30%).The Overall Evaluation of the Results• Increasing the size of the data set leads increment theaccuracy.
  19. 19.  Bridge the critical gap of available open-sourcespell checking resources for Arabic. Create open-source and large-coverage word listfor Arabic (9,000,000 words). Error Detection: Direct method: match words in an open textagainst a list of correct words. Language modeling method: build a character-basedtri-gram language modal using SRILM in order toclassify generated words as valid and invalid.4. Arabic Word Generation and Modeling for SpellChecking
  20. 20. InputFinite-StateTransducerError ?Suggestion listCandidates listscoreCandidates ranker augmentededit distance and languagespecific rulesPost-processingDisplaysuggestionsArabicword listNoisy channelmodelGigawordcorpusYesNoFlow chart of spelling errorcorrection.
  21. 21.  Best accuracy score = 75% Evaluation on: Microsoft Word 2010 = 80.54% Hunspell using Ayaspell = 45.64%Result
  22. 22. Language modelSpelling error detection andcorrection componentsDictionary (orreference wordlist)Error model5. Improved Spelling Error Detection and Correctionfor Arabic
  23. 23. AraComLexExtendedword list• Matching itsword listagainstGigawordcorpus• Double-checked byBuckwalterArabicMorphologicalAnalyzer• Creating adictionary of9.3 millionArabic wordsImproving the Dictionary
  24. 24. Finite-statetransducer toproposecandidatecorrectionsDiscardcandidates thatare not found inthe word listRank theremainingcandidates Spelling Correction: N-gram language models. The candidate with the least perplexityscore is selected to be the gold correction.Improving the Error Model: Candidate Generation
  25. 25.  Analyze the level of noise in different sources ofdata. Agence France-Presse (AFP) is the noisiest whileAl- Jazeera data is the cleanest. Select the optimal subset to train the system on.Improving the language model: Analyzing the TrainingData
  26. 26.  AFP = 73.93 % Al-Jazeera data = 80.97 % Gigaword corpus = 82.86 % Clean data is better than noisy data when they arecomparable in size, however more data is betterthan clean data. Evaluation on: Google Docs = 9.32 % Ayaspell for OpenOffice = 41.86 % Microsoft Word 2010 = 57.15 %Result
  27. 27.  After displaying these approaches we see that theresults are promising, and represent a good startingpoint for future researches to enhance the Arabicspell checker.Conclusion
  28. 28. THANKS 