Your SlideShare is downloading. ×
0
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Arabic spell checking approaches
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Arabic spell checking approaches

439

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
439
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Arabic SpellCheckingApproachesBy: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrfSupervised by: Dr. Amal Al-SaifNatural language processing - CS465
  • 2.  Introduction Common Arabic Spell Error Towards Automatic Spell Checking for Arabic Towards Arabic Spell-Checker Based on N-Grams Scores Automatic Stochastic Arabic Spelling Correction With Emphasison Space Insertions and Deletions Arabic Word Generation and Modeling for Spell Checking Improved Spelling Error Detection and Correction for Arabic ConclusionOutline
  • 3.  Arabic language NLP applications Approaches for solving the Arabic spellchecking problemIntroduction
  • 4. Common Arabic Spell Errors• Reading Errors{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ } { }• Hearing Errors{ }{ }{ }{ }{ }{ }{ }{ }{ }• Touch-Typing Errors• Morphological Errors• Editing Errors
  • 5. 1.Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions• Stochastic-based approach for misspelling correction ofArabic text.• A context based on two-layer that is automaticallycorrect misspelled words in large datasets.
  • 6. Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions (Cont.)Candidates’ generationcomponentError detectionBest candidate selectioncomponentSingle-WordErrorsSpace DeletionErrorsSpace InsertionErrors
  • 7. Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions (Cont.)Candidates’ generation component: Space Deletion Errors
  • 8. Result• A standard Arabic text corpus (TRN_DB_I)• An extra standard Arabic text corpus(TRN_DB_II)• The test data (TST_DB)• The testing results show that as we increase the sizeof the training set, the performance improves reaching97.9% of F1 score for detection and 92.3% of F1score for correction.
  • 9. 2.Towards Automatic Spell Checking for Arabic• Developing an Arabic spelling checker program.• Using SICStus Prolog language.• Recognizes common Arabic spelling errors and offerssuggestions for error correction.• Be able to recognize common spelling errors for standardArabic and Egyptian dialects.• Can be integrated with other text processing software, suchas word processors.
  • 10. • Analysis of the common spelling errors that are used fordetecting the misspelled Arabic word.• Limited the detection of spelling errors to isolated words (non–word). e.g. ‘ ’ for ‘ ’.• Perform a series of heuristicsteps to find a replacementcandidate:Add missingcharacterReplace incorrectcharacterRemove excessivecharacterAdd a space tosplit wordsTowards Automatic Spell Checking for Arabic(Cont.)
  • 11. • e.g. the candidates of the misspelled wordare : ,Add missingcharacter• e.g. the candidates of the misspelled wordare : , ,Replaceincorrectcharacter• e.g. the candidates of the misspelled wordare : ,Removeexcessivecharacter• e.g. the candidates of the misspelled wordare : ,Add a spaceto split wordsTowards Automatic Spell Checking forArabic(Cont.)
  • 12. Neighbors tableTowards Automatic Spell Checking for Arabic(Cont.)
  • 13. • Developing a simple and flexible spell-checker for Arabiclanguage (detect errors).• Based on N-Grams scores.• Using matrix approach.• The corpus which is used is adapted from Muaidi PHDthesis .• It is consists of 101,987 word types.3.Towards Arabic Spell-Checker Based onN-Grams Scores
  • 14. Entered thetested textTokenizingprocessCleaningprocessMatrix methoddeals witheach wordTowards Arabic Spell-Checker Based onN-Grams Scores(Cont.)
  • 15. • Building the matrices Number of matrices = longest word in corpus – 1. Dimension of each matrix is 28 28( for Arabic letter). (M1) for the combination of the first and the second lettersin a word. (M2) for the combination of the second and thethird letters in a word and so on. All the matrices are initialized by zeros.Matrix Method Deals With Each Word
  • 16. • 2-Gram set (S) Each item in (S) consists of two letters. The item will assign the value 1 or 2 Assigned 2 in the corresponding matrix; if the word isended by these two letters. Assigned 1 if there is a connection and the word is notover yet. e.g. for the word:[ ]the 2-Gram set is S = { }M1[ ] [ ] = 1, M [ ] [ ] = 1, M3[ ][ ]=2.Matrix Method Deals With Each Word
  • 17. Entered thetested textTokenizingprocessCleaningprocessMatrix methoddeals witheach wordMatrix Method Deals With Each Word(Cont.)
  • 18. Result• The training dataset consists of 71,390 Arabic words (70%)and While the testing dataset consists of 30,597 Arabic words(30%).The Overall Evaluation of the Results• Increasing the size of the data set leads increment theaccuracy.
  • 19.  Bridge the critical gap of available open-sourcespell checking resources for Arabic. Create open-source and large-coverage word listfor Arabic (9,000,000 words). Error Detection: Direct method: match words in an open textagainst a list of correct words. Language modeling method: build a character-basedtri-gram language modal using SRILM in order toclassify generated words as valid and invalid.4. Arabic Word Generation and Modeling for SpellChecking
  • 20. InputFinite-StateTransducerError ?Suggestion listCandidates listscoreCandidates ranker augmentededit distance and languagespecific rulesPost-processingDisplaysuggestionsArabicword listNoisy channelmodelGigawordcorpusYesNoFlow chart of spelling errorcorrection.
  • 21.  Best accuracy score = 75% Evaluation on: Microsoft Word 2010 = 80.54% Hunspell using Ayaspell = 45.64%Result
  • 22. Language modelSpelling error detection andcorrection componentsDictionary (orreference wordlist)Error model5. Improved Spelling Error Detection and Correctionfor Arabic
  • 23. AraComLexExtendedword list• Matching itsword listagainstGigawordcorpus• Double-checked byBuckwalterArabicMorphologicalAnalyzer• Creating adictionary of9.3 millionArabic wordsImproving the Dictionary
  • 24. Finite-statetransducer toproposecandidatecorrectionsDiscardcandidates thatare not found inthe word listRank theremainingcandidates Spelling Correction: N-gram language models. The candidate with the least perplexityscore is selected to be the gold correction.Improving the Error Model: Candidate Generation
  • 25.  Analyze the level of noise in different sources ofdata. Agence France-Presse (AFP) is the noisiest whileAl- Jazeera data is the cleanest. Select the optimal subset to train the system on.Improving the language model: Analyzing the TrainingData
  • 26.  AFP = 73.93 % Al-Jazeera data = 80.97 % Gigaword corpus = 82.86 % Clean data is better than noisy data when they arecomparable in size, however more data is betterthan clean data. Evaluation on: Google Docs = 9.32 % Ayaspell for OpenOffice = 41.86 % Microsoft Word 2010 = 57.15 %Result
  • 27.  After displaying these approaches we see that theresults are promising, and represent a good startingpoint for future researches to enhance the Arabicspell checker.Conclusion
  • 28. THANKS 

×