Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intelligent Text Document Correction System Based on Similarity Technique

1,270 views

Published on

Automatic text correction is one of the human-computer interaction challenges. It is directly interposed with several application areas like post handwritten text digitizing correction or indirectly such as user's queries correction before applying a retrieval process in interactive databases.
Automatic text correction process passes through two major phases: error detection and candidates suggestion. Techniques for both phases are categorized into: Procedural and statistical. Procedural techniques are based on using rules to govern texts acceptability, including Natural Language Processing Techniques. Statistical techniques, on the other hand, are dependent on statistics and probabilities collected from large corpus based on what is commonly used by humans.
In this work, natural language processing techniques are used as bases for analysis and both spell and grammar acceptance checking of English texts. A prefix dependent hash-indexing scheme is used to shorten the time of looking up the underhand dictionary which contains all English tokens. The dictionary is used as a base for the error detection process.
Candidates generation is based on calculating source token similarity, measured using an improved Levenshtein method, to the dictionary tokens and ranking them accordingly; however this process is time extensive, therefore, tokens are divided into smaller groups according to spell similarity in such a way keeps the random access availability. Finally, candidates suggestion involves examining a set of commonly committed mistakes related features. The system selects the optimal candidate which provides the highest suitability and doesn't violate grammar rules to generate linguistically accepted text.
Testing the system accuracy showed better results than Microsoft Word and some other systems. The enhanced similarity measure reduced the time complexity to be on the boundaries of the original Levenshtein method with an additional error type discovery.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Intelligent Text Document Correction System Based on Similarity Technique

  1. 1. Intelligent Text Document Correction System Based on Similarity Technique A Thesis Submitted to the Council of the College of Information Technology, University of Babylon in Partial Fulfillment of the Requirements for the Degree of Master of Sciences in Computer Sciences. By Marwa Kadhim Obeid Al-Rikaby Supervised by Prof. Dr. Abbas Mohsen Al-Bakry 2015 D.C. 1436 A.H. Ministry of Higher Education and Scientific Research University of Babylon- College of Information Technology Software Department
  2. 2. II ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ي‬ِ‫ح‬َّ‫ر‬‫ال‬ ِ‫ن‬ٰ‫ـ‬‫ـ‬‫ـ‬‫ـ‬َ‫م‬ْ‫ح‬َّ‫ر‬‫ال‬ ِ‫ه‬َّ‫الل‬ ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬ْ‫س‬ِ‫ب‬ {ِ‫ب‬ ‫ِي‬‫د‬ْ‫ه‬َ‫ي‬‫َا‬‫و‬ْ‫ض‬ِ‫ر‬ َ‫ع‬َ‫ب‬َّ‫ْت‬‫ا‬ ِ‫ن‬َ‫م‬ ُ‫هلل‬ْ‫ا‬ ِ‫ه‬‫َم‬‫ل‬َّ‫س‬‫ْل‬‫ا‬ َ‫ل‬ُ‫ب‬ُ‫س‬ ُ‫ه‬َ‫ن‬ِٰ َ‫و‬ُ‫ي‬ْ‫خ‬ِ‫ر‬ُ‫ج‬ُ‫ه‬ِِّ‫م‬ ‫م‬َ‫ن‬ْ‫ا‬ُّ‫لظ‬ُ‫ل‬َ‫م‬‫ا‬ِ‫ت‬ِ‫إ‬َ‫ل‬‫ى‬َْٰ‫ا‬ُّ‫ن‬‫ل‬ِ‫ر‬‫و‬ِ‫ب‬ِ‫إ‬ْ‫ذ‬ِ‫ن‬ِ‫ه‬َ‫و‬َ‫ي‬ْ‫ه‬ِ‫د‬ِ‫ه‬‫ي‬ْ‫م‬ِ‫إ‬‫ىل‬َٰ َِ‫ص‬‫ر‬ٍٰ‫ط‬ُّ‫م‬ْ‫س‬َ‫ت‬ِ‫ق‬ٍ‫م‬‫ي‬} َ‫ص‬َ‫د‬َ‫ق‬َ‫ع‬‫ال‬ ‫اهلل‬ِ‫ل‬ُ‫ي‬َ‫ع‬‫ال‬ِ‫ظ‬‫يم‬ ‫امل‬ ‫سورة‬‫ئادةة‬‫آية‬16
  3. 3. III Supervisor Certification I certify that this thesis was prepared under my supervision at the Department of Software / Information Technology / University of Babylon, by Marwa Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the degree of Master of Sciences in Computer Science. Signature: Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry Title : Professor. Date : / / 2015 The Head of the Department Certification In view of the available recommendation, we forward this thesis for debate by the examining committee. Signature: Name : Dr. Eman Salih Al-Shamery Title: Assist. Professor. Date: / / 2015
  4. 4. IV To Master of creatures, Loved by Allah, The Prophet Muhammad (Allah bless him and his family)
  5. 5. V Acknowledgements All praise be to Allah Almighty who enabled me to complete this task successfully and utmost respect to His last Prophet Mohammad PBUH. First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al- Bakry, for his advice and guidance that led to the completion of this thesis. I would like to thank the staff of the Software Department for the help they have offered, especially, the head of the Software Department Dr. Eman Salih Al-Shamery. Most importantly, I would like to thank my parents, my sisters, my brothers and my friends for their support.
  6. 6. VI Abstract Automatic text correction is one of the human-computer interaction challenges. It is directly interposed with several application areas like post handwritten text digitizing correction or indirectly such as user's queries correction before applying a retrieval process in interactive databases. Automatic text correction process passes through two major phases: error detection and candidates suggestion. Techniques for both phases are categorized into: Procedural and statistical. Procedural techniques are based on using rules to govern texts acceptability, including Natural Language Processing Techniques. Statistical techniques, on the other hand, are dependent on statistics and probabilities collected from large corpus based on what is commonly used by humans. In this work, natural language processing techniques are used as bases for analysis and both spell and grammar acceptance checking of English texts. A prefix dependent hash-indexing scheme is used to shorten the time of looking up the underhand dictionary which contains all English tokens. The dictionary is used as a base for the error detection process. Candidates generation is based on calculating source token similarity, measured using an improved Levenshtein method, to the dictionary tokens and ranking them accordingly; however this process is time extensive, therefore, tokens are divided into smaller groups according to spell similarity in such a way keeps the random access availability. Finally, candidates suggestion involves examining a set of commonly committed mistakes related features. The system selects the optimal candidate which provides the highest suitability and doesn't violate grammar rules to generate linguistically accepted text. Testing the system accuracy showed better results than Microsoft Word and some other systems. The enhanced similarity measure reduced the time complexity to be on the boundaries of the original Levenshtein method with an additional error type discovery.
  7. 7. VII Table of Contents Subject Page No. Chapter One : Overview 1.1 Introduction 1 1.2 Problem Statement 3 1.3 Literature Review 5 1.4 Research Objectives 10 1.5 Thesis Outlines 11 Chapter Two: Background and Related Concepts Part I: Natural Language Processing 12 2.1 Introduction 12 2.2 Natural Language Processing Definition 12 2.3 Natural Language Processing Applications 13 2.3.1 Text Techniques 14 2.3.2 Speech Techniques 15 2.4 Natural Language Processing and Linguistics 16 2.4.1 Linguistics 16 2.4.1.1 Terms of Linguistic Analysis 17 2.4.1.2 Linguistic Units Hierarchy 19 2.4.1.3 Sentence Structure and Constituency 19 2.4.1.4 Language and Grammar 20 2.5 Natural Language Processing Techniques 22 2.5.1 Morphological Analysis 22 2.5.2 Part of Speech Tagging 23 2.5.3 Syntactic Analysis 26 2.5.4 Semantic Analysis 27 2.5.5 Discourse Integration 27 2.5.6 Pragmatic Analysis 28 2.6 Natural Language Processing Challenges 28 2.6.1 Linguistics Units Challenges 28 2.6.1.1 Tokenization 28 2.6.1.2 Segmentation 29 2.6.2 Ambiguity 31 2.6.2.1 Lexical Ambiguity 31
  8. 8. VIII Subject Page No. 2.6.2.2 Syntactic Ambiguity 31 2.6.2.3 Semantic Ambiguity 32 2.6.2.4 Anaphoric Ambiguity 32 2.6.3 Language Change 32 2.6.3.1 Phonological Change 33 2.6.3.2 Morphological Change 33 2.6.3.3 Syntactic Change 33 2.6.3.4 Lexical Change 33 2.6.3.5 Semantic Change 34 Part II: Text Correction 35 2.7 Introduction 35 2.8 Text Errors 35 2.8.1 Non-words Errors 36 2.8.2 Real-word Errors 36 2.9 Error Detection Techniques 37 2.9.1 Dictionary Looking Up 37 2.9.1.1 Dictionaries Resources 37 2.9.1.2 Dictionaries Structures 38 2.9.2 N-gram Analysis 39 2.10 Error Correction Techniques 40 2.10.1 Minimum Edit Distance Techniques 40 2.10.2 Similarity Key Techniques 43 2.10.3 Rule Based Techniques 43 2.10.4 Probabilistic Techniques 43 2.11 Suggestion of Corrections 44 2.12 The Suggested Approach 44 2.12.1 Finding Candidates Using Minimum Edit Distance 45 2.12.2 Candidates Mining 45 2.12.3 Part-of-Speech Tagging and Parsing 46 Chapter Three : Hashed Dictionary and Looking Up Technique 3.1 Introduction 48 3.2 Hashing 48 3.2.1 Hash Function 49 3.2.2 Formulation 52 3.2.3 Indexing 53 3.3 Looking Up Procedure 56
  9. 9. IX Subject Page No. 3.4 Dictionary Structure Properties 58 3.5 Similarity Based Looking-Up 59 3.5.1 Bi-grams Generation 60 3.5.2 Primary Centroids Selection 62 3.5.3 Centroids Referencing 63 3.6 Application of Similarity Based Looking up approach 64 3.7 The Similarity Based Looking up Properties 67 Chapter Four : Error Detection and Candidates Generation 4.1 Introduction 69 4.2 Non-word Error Detection 69 4.3 Real-Words Error Detection 71 4.4 Candidates Generation 72 4.4.1 Candidates Generation for Non-word Errors 72 4.4.1.2 Enhanced Levenshtein Method 74 4.4.1.3 Similarity Measure 78 4.4.1.4 Looking for Candidates 79 4.4.2 Candidates Generation for Real-words Errors 81 Chapter Five : Text Correction and Candidates Suggestion 5.1 Introduction 82 5.2 Correction and Candidates Suggestion Structure 82 5.3 Named-Entity Recognition 85 5.4 Candidates Ranking 86 5.4.1 Edit Distance Based Similarity 87 5.4.2 First and End Symbols Matching 87 5.4.3 Difference in Lengths 88 5.4.4 Transposition Probability 89 5.4.5 Confusion Probability 90 5.4.6 Consecutive Letters (Duplication) 91 5.4.7 Different Symbols Existence 92 5.5 Syntax Analysis 93 5.5.1 Sentence Phrasing 93 5.5.2 Candidates Optimization 95 5.5.3 Grammar Correction 95 5.5.4 Document Correction 97 Chapter Six: Experimental Results, Conclusions, and Future Works
  10. 10. X Subject Page No. 6.1 Experimental Results 98 6.1.1 Tagging and Error Detection Time Reduction 98 6.1.1.1 Successful Looking Up 99 6.1.1.2 Failure Looking Up 100 6.1.2 Candidates Generation and Similarity Search Space Reduction 101 6.1.3 Time Reduction of the Damerau-Levenshtein method 103 6.1.4 Features Effect on Candidates Suggestion 104 6.2 Conclusions 107 6.3 Future Works 108 References 110 Appendix A 117 Appendix B 122 List of Figures Figure No. Title Page No. (2.1) NLP dimensions 16 (2.2) Linguistics analysis steps 17 (2.3) Linguistic Units Hierarchy 19 (2.4) Classification of POS tagging models 24 (2.5) An example of lexical change 34 (2.6) Outlines of Spell Correction Algorithm 38 (2.7) Levenshtein Edit Distance Algorithm 41 (2.8) Damerau-Levenshtein Edit Distance Algorithm 42 (2.9) The Suggested System Block Diagram 47 (3.1) Token Hashing Algorithm 54
  11. 11. XI Figure No. Title Page No. (3.2) Dictionary Structure and Indexing Scheme 55 (3.3) Algorithm of Looking Up Procedure 57 (3.4) Semi Hash Clustering block diagram 61 (3.5) Similarity Based Hashing algorithm 64 (3.6) Block diagram of candidates generation using SBL 66 (3.7) Similarity Based Looking up algorithm 68 (4.1) Tagging Flow Chart 70 (4.2) The Enhanced Levenshtein Method Algorithm 76 (4.3) Original Levenshtein Example 77 (4.4) Damerau-Levenshtein Example 77 (4.5) Enhanced Levenshtein Example 78 (5.1) Candidates ranking flowchart 84 (5.2) Syntax analysis flowchart 94 (6.1) Tokens distribution in primary packets 99 (6.2) Tokens distribution in secondary packets 99 (6.3) Time complexity Variance of Levenshtein, Damerau- Levenshtein, and Enhanced Levenshtein (our modification) 103 (6.4) Suggestion Accuracy with a comparison to Microsoft Office Word on a Sample from the Wikipedia 104 (6.5) Testing the suggested system accuracy and comparing the results with other systems using the same dataset 105 (6.6) Discarding one feature at a time for optimal candidate selection 106 (6.7) Using one feature at a time for optimal candidate selection 107
  12. 12. XII List of Tables Table No. Title Page No. (1-1) Summary of Literature Review 9 (3-1) Alphabet Encoding 50 (3-2) Addressing Range 52 (3-3) Predicting errors using Bi-grams analysis 61 (5-1) Transposition Matrix 90 (5-2) Confusion Matrix 91 List of Symbols and Abbreviations MeaningAbbreviation Alphabet∑ Adjectival PhraseA Absolute Differenceabs Sentence ComplementC Context Free GrammarCFG DictionaryD Dioxide Nuclear AcidDNA ErrorE GrammarG Grammar Error CorrectionGEC Hidden Markov ModelHMM Information RetrievalIR Machine TranslationMT Named EntityNE Named-Entity RecognitionNER Noun GroupNG Natural Language GenerationNLG Natural Language ProcessingNLP Natural LanguagesNLs Natural Language UnderstandingNLU
  13. 13. XIII Noun PhraseNP big-Oh notation ( =at most)O( ) Optical Character RecognitionOCR Production RuleP Part Of SpeechPOS Prepositional PhrasePP QueryQ Ranking ValueR Relative DistanceR_Dist Start SymbolS Stanford Machine TranslatorSMT Speech RecognitionSR String1, String2St1,St2 VariableV Adverbial Phrasev Verb PhraseVP big-Omega notation (= at least)Ω( )
  14. 14.   Chapter One Overview
  15. 15. 1 Chapter One Overview 1.1 Introduction Natural Language Processing, also known as computational Linguistics, is the field of computer science that deals with linguistics; it is a form of human- computer interaction where formalization is applied on the elements of human language to be performed by a computer [Ach14]. Natural Language Processing (NLP) is the implementation of systems that are capable of manipulating and processing natural languages (NLs) sentences[Jac02] like English, Arabic, Chinese and not formal languages like Python, Java, C++; nor descriptive languages such as DNA in biology and Chemical formulas in chemist [Mom12]. NLP task is the designing and building of software for analyzing, understanding and generating spoken and/or written NLs. [Man08] [Mis13] NLP has many applications such as automatic summarization, Machine Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition (SR), Optical Character Recognition (OCR), Information Retrieval (IR), Opinion Mining [Nad11], and others [Wol11]. Text Correction is another significant application of NLP. It includes both Spell Checking and Grammar Error Correction (GEC). Spell checking research extends early back to the mid of 20th century by Lee Earnest at Stanford University but the first application was created in 1971 by Ralph Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of 10,000 English words. [Set14] [Pet80] Grammar error correction, in spite of its central role in semantic and meaning representations, is largely ignored by NLP community. In recent
  16. 16. Chapter One   Overview ________________________________________________________________________ 2 years, an improvement noticed in automatic GEC techniques. [Voo05] [Jul13] However, most of these techniques are limited in specific domains such as real-word spell correction [Hwe14], subject-verb disagreement [Han06], verb tense misuse [Gam10], determiners or articles and improper preposition usage. [Tet10] [Dah11] Different techniques like edit distance [Wan74], rule-based techniques [Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98], probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel model [Tou02] have been proposed for text correction purposes. Each technique needs some sort of resources. Edit distance, rule-based and similarity key techniques require a dictionary (or lexicon), n-grams and probabilistic work with statistical and frequency information, neural nets are learned with training patterns, etc… Text correction, spell and grammar, is an extensive process includes, typically, three major steps: [Ach14] [Jul13] The first step is to detect the incorrect words. The most popular way to decide if a word is misspelled is to look for it in a dictionary, a list of correctly spelled words. This way can detect non-word errors not the real- word errors [Kuk92] [Mis13] because an unintended word may match a word in the dictionary. NLs have a large number of words resulting in a huge dictionary, therefore, the task of looking every word consumes a long time. Whereas, in GEC this step is more complicated, it requires applying more analysis at the level of sentences and phrases using computational linguistics basics to detect the word that makes the sentence incorrect. Next, a list of candidates or alternatives should be generated for the incorrect word (misspelled or misused). This list is preferred to be short and contains the words with highest similarity or suitability; and to produce it, a technique is needed to calculate the similarity of the incorrect word with
  17. 17. Chapter One   Overview ________________________________________________________________________ 3 every word in the dictionary. Efficiency and accuracy are major factors in the selection of such technique. GEC requires broad knowledge of diverse grammatical error categories and extensive linguistic technique to identify alternatives because a grammatical error mayn't be resulted from a unique word. Finally, suggesting the intended word or a list of alternatives contains the intended word. This task requires ranking the words according to the similarity amount to the incorrect word and some other considerations may or may not be taken depending on the technique in use. Text mining techniques started to enter the area of text correction; Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and Information Retrieval [Kir11] are examples. Statistics and probabilistic also played a great role specifically in analyzing common mistakes and n-gram datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and phonetic level, can be used in reducing the looking up space; NER may help in avoiding interpreting proper nouns as misspellings; statistics merged with NLP techniques to provide more precise parsing and POS tagging, usually, in context dependent applications. The application of a given technique differs according to what level of correction is intended; it starts from the character level [Far14], passes through word, phrase (usually in GEC), sentence, and ends in the context or document subject level. 1.2 Problem Statement Although many text checking and correction systems are produced, each has its variances from the sides of input quality restrictions, techniques used, output accuracy, speed, performance conditions…etc. [Ahm09] [Pet80]. This field of NLP is really an open research from all sides because there is no complete algorithm or technique handles all considerations.
  18. 18. Chapter One   Overview ________________________________________________________________________ 4 The limited linguistic knowledge, the huge number of lexicons, the extensive grammar, language ambiguity and change over time, variety of committed errors and computational requirements are challenges facing the process of developing a text correction application. In this work, some of the above mentioned problems are solved using a set of solutions:  Integrating two lexicon datasets (WordNet and Ispell).  Using brute-force approach to solve some sorts of ambiguity.  Applying hashing and indexing in looking up the dictionary.  Reducing search space in candidates collecting process by grouping similarly spelled words into semi clusters. The Levenshtein method [Hal11] is also enhanced to consider Damerau four types of errors within time period shorter than Damerau-Levenshtein method [Hal11]. Named Entity Recognition, letters confusion and transposition, and candidate length effect are used as features to optimize the candidates' suggestion. In addition to applying rules of Part-Of-Speech tags and sentence constituency for checking sentence grammar correctness, whether it is lexically corrected or is not. The proposed three components of this system are: (1)a spell error detection is based on a fast looking up technique in a dictionary of more than 300,000 tokens, constructed by applying a string prefix dependent hash function and indexing method; grammar error detector is a brute-force parser. (2)For candidates generation, an enhancement was implemented on the Levenshtein method to consider Damerau four errors types and then used to measure similarity according to the minimum edit distance and difference in lengths effect, the dictionary tokens are grouped into spell based clusters to reduce search space. (3)The candidates suggestion exploits NER features,
  19. 19. Chapter One   Overview ________________________________________________________________________ 5 transposition error and confusion statistics, affixes analysis (including first and last letters matching), length of candidates, and parsing success. 1.3 Literature Review  Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to string transformation includes a model consists of rules and weights for training and an algorithm depends on scoring and ranking according to conditional probability distribution for generating the top k-candidates at the character level where both high and low frequency words can be generated. Spell checking is one of many applications on which the approach was applied; the misspelled strings (words or characters) are transformed by applying a number of operators into the k-most similar strings in a dictionary (start and end letters are constants). [Ach14]  Mariano F., Zheng Y., and others, 2014, talked the correction of grammatical errors by processes pipelining which combines results from multiple systems. The components of the approach are: a rule based error corrector uses rules automatically derived from the Cambridge Learner Corpus which based on N-grams that have been annotated as incorrect; SMT system translates incorrectly written English into correct English; NLTK1 was used to perform segmentation, tokenization, and POS tagging; the candidates generation produce all the possible combinations of corrections for the sentence, in addition to the sentence itself to consider the "no correction" option; finally the candidates are ranked using a language model. [Fel14] __________________________________________________________ 1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language.
  20. 20. Chapter One   Overview ________________________________________________________________________ 6  Anubhav G., 2014, presented a rule-based approach that used two POS taggers to correct non-native English speakers' grammatical errors, Stanford parser and Tree Tagger. The detection of errors depends on the outputs of the two taggers, if they differ then the sentence is not correct. Errors are corrected using Nodebox English Linguistic library. Error correction includes subject-verb disagreement, verb form, and errors detected by POS tag mismatch. [Gup14]  Stephan R., 2013, proposed a model for spelling correction based on treating words as "documents" and spell correction as a form of document retrieval in that the model retrieves the best matching correct spell for a given input. The words are transformed into tiny documents of bits and hamming distance is used to predict the closest string of bits from a dictionary holding the correctly spelled words as strings of bits. The model is knowledge free and only contains a list of correct words. [Raa13]  Youssef B., 2012, produced a parallel spell checking algorithm for spelling errors detection and correction. The algorithm is based on information from Yahoo! N-gram dataset 2.0; it is a shared memory model allowing concurrency among threads for both parallel multi processor and multi core machines. The three major components (error detector, candidates' generator and error corrector) are designed to run in a parallel fashion. Error detector, based on unigrams, detects non-word errors; candidates' generator is based on bi-grams; the error corrector, context sensitive, is based on 5-grams information.[Bas12]  Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and Gary G. L., 2012, presented a novel method for grammatical error correction by building a meta-classifier. The meta-classifier decides the final output depending on the internal results from several base classifiers; they used multiple grammatical errors tagged corpora with
  21. 21. Chapter One   Overview ________________________________________________________________________ 7 different properties in various aspects. The method focused on the articles and the correction arises only when a mismatching occur with the observed articles. [Seo12]  Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic information retrieval system performing automatic spell correction for user queries before applying the retrieval process. The correcting procedure depends on matching the misspelled word against a correctly spelled words dictionary using Levenshtein algorithm. If an incorrect word is encountered then the system retrieves the most similar word depending on the Levenshtein measure and the occurrence frequency of the misspelled word.[Kir11]  Farag, Ernesto, and Andreas, 2008, developed a language-independent spell checker. It is based on the enhancement of N-gram model through creating a ranked list of correction candidates derived based on N-gram statistics and lexical resources then selecting the most promising candidates as correction suggestions. Their algorithm assigns weights to the possible suggestions to detect non-word errors. They depended a "MultiWordNet" dictionary of about 80,000 entries.[Ahm09]  Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of real-words spelling error correction. They assumed that the observed sentence is a signal passed through a noisy channel, where the channel reflects the typist and the distortion reflects errors committed by the typist. The probability of the sentence correctness, given by the channel (typist), is a parameter associated with that sentence. The probability of every word in the sentence to be the intended one is equivalent to the sentence correctness probability and the word is associated with a set of spell variants words excluding the word itself. Correction can be applied to one word in the sentence by replacing the incorrect one by another
  22. 22. Chapter One   Overview ________________________________________________________________________ 8 from the candidates (its real-word spelling variations) set so that it gives the maximum probability.[Amb08]  Stoyan, Svetla, and others, 2005, described an approach for lexical post- correction of the output of optical character recognizer OCR as a two research project. They worked on multiple sides; on the dictionary side, they enriched their large sizes dictionaries with specialty dictionaries; on the candidates selection, they used a very fast searching algorithm depends on Levenshtein automata for efficient selecting the correction candidates with a bound not exceeding 3; they ranked candidates depending on a number of features such as frequency and edit distance.[Mih04]  Suzan V., 2002, described a context sensitive spell checking algorithm based on the BESL spell checker lexicons and word trigrams for detecting and correcting real-word errors using probability information. The algorithm splits up the input text into trigrams and every trigram is looked up in a precompiled database which contains a list of trigrams and their occurrence number in the corpus used for database compiling. The trigram is correct if it is in the trigram database, otherwise it is considered an erroneous trigram containing a real-word error. The correction algorithm uses BESL spell checker to find candidates but the most frequent in the trigrams database are suggested to the user.[Ver02]
  23. 23. Chapter One   Overview ________________________________________________________________________ 9 No. Reference Methodology Technique 1 [Ach14] Generating the top K- candidates at the character level for both high and low frequency. A model consists of rules and weights, and a conditional probability distribution dependent algorithm 2 [Fel14] Grammatical errors correction based on generating all possible correct alternatives for the sentence Combining the results of multiple systems: rule based error corrector, SMT English to Correct English translator, and NLTK for segmentation, tokenization and tagging 3 [Gup14] Non-native English speakers' grammatical errors correction Error detection used Stanford parser and Tree Tagger. Correction based on Nodebox English Linguistic library 4 [Raa13] Dictionary based Spell correction treats the misspelled word as a document. Converting the misspelled word into a tiny document of bits and retrieving the most similar documents using Hamming Distance 5 [Bas12] Context sensitive spell checking using a shared memory model allowing concurrency among threads for parallel execution Different N-grams levels for error detection, candidates generation, and candidates suggestion depending on Yahoo! N-Grams dataset 2.0 6 [Seo12] Meta-classifier for grammatical errors correction focused mainly on the articles. Deciding the output depending on the internal results from several base classifiers 7 [Kir11] Automatic spell correction for user queries before applying retrieval process Using Levenshtein algorithm for both error detection and correction in a dictionary looking up technique Table 1.1: Summary of Literature Review
  24. 24. Chapter One   Overview ________________________________________________________________________ 11 8 [Ahm09] Language independent model for non-word error correction based on N- gram statistics and lexical resources Ranking a list of correction candidates by assigning weights to the possible suggestions depending on a "MultiWordNet" dictionary of about 80,000 entries 9 [Amb08] Noisy channel model for Real words error correction based on probability. Channel represents the typist, distortion represents the error, and the noise probability is a parameter 10 [Mih04] OCR output post correction Levenshtein automata for candidates generation and frequency for ranking 11 [Ver02] Context sensitive spell checking algorithm based on tri-grams Splitting texts into word trigrams and matching them against the precompiled BESL spell checker lexicons, suggestion depends on probability information. 1.4 Research Objectives This research is attempted to design and implement a smart text document correction system for English texts. It is based on mining a typed text for detecting spelling and grammar errors and giving the optimal suggestion(s) from a set of candidates, its steps are: 1. Analyzing the given text by using Natural Language Processing techniques, at each step detect the erroneous words. 2. Looking up candidates for the erroneous words and ranking them according to a given set of features and conditions to be the initial solutions. 3. Optimizing the initial solutions depending on the extracted information from the given text and the detected errors.
  25. 25. Chapter One   Overview ________________________________________________________________________ 11 4. Recovering the input text document with the optimal solutions and associating the best set of candidates with each incorrect detected word. 1.5 Thesis Outlines The next five chapters are: 1. Chapter Two: "Background and Related Concepts" consisted of two parts. The first overviews NLP fundamentals, applications and techniques; whereas, the second is about text correction techniques. 2. Chapter Three: "Dictionary Structure and Looking up Technique" describes the suggested approach of constructing the dictionary of the system for both perfect matching and similarity looking up. 3. Chapter Four: "Error Detection and Candidates Generation", declares the suggested technique for indicating incorrect words and the method of generating candidates. 4. Chapter Five: "Automatic Text Correction and Candidates Suggestion", describes the techniques of suggestions selection and optimization. 5. Chapter Six: "Experimental Results, Conclusion, and Future Works", the experimental results of applying the techniques described in chapters three, four and five, conclusion of the system and the future directions are shown.
  26. 26.   Chapter Two Background and Related Concepts
  27. 27.  12  Chapter Two Background and Related Concepts Part I Natural Language Processing 2.1 Introduction Natural Language Processing (NLP) began in the late 1940s. It was focused on machine translations; in 1958, NLP was linked to the information retrieval by the Washington International Conference of Scientific Information; [Jon01] primary ideas for developing applications for detecting and correcting text errors started at that period of time. [Pet80] [Boo58] Natural Language Processing has a great interest from that time till our days because it plays an important role in the interaction between human and computers. It represents the intersection of linguistics and artificial intelligence [Nad11] where machine can be programmed to manipulate natural language. 2.2 Natural Language Processing Definition "Natural Language Processing (NLP) is the computerized approach for analyzing text that is based on both a set of theories and a set of technologies." [Sag13] NLP describes the function of software or hardware components in a computer system that is capable of analyzing or synthesizing human languages (spoken or written) [Jac02] [Mis13] like English, Arabic, Chinese …etc, not formal languages like Python, Java, C++ … etc, nor
  28. 28. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  13  descriptive languages such as DNA in biology and Chemical formulas in chemist [Mom12]. "NLP is a tool that can reside inside almost any text processing software application" [Wol11] We can define NLP as a subfield of Artificial Intelligence encompasses anything needed by a computer to understand and generate natural language. It is based on processing human language for two tasks: the first receives a natural language input (text or speech), applies analysis, reasons what was meant by that input, and outputs in computer language; this is the task of Natural Language Understanding (NLU). While the second task is to generate human sentences according to specific considerations, the input is in computer language but the output is in human languages; it is called Natural Language Generation (NLG). [Raj09] "Natural Language Understanding is associated with the more ambitious goals of having a computer system actually comprehend natural language as a human being might". [Jac02] 2.3 Natural Language Processing Applications Even of its wide usage in computer systems, NLP is entirely disappeared into the background; where it is invisible to the user and adds significant business value. [Wol11] The major distinction of NLP applications from other data processing systems is that they use Language Knowledge. Natural Language Processing applications are mainly divided into two categories according to the given NL format [Mom12] [Wol11]:
  29. 29. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  14  2.3.1Text Technologies  Spell and Grammar Checking: systems deal with indicating lexical and grammar errors and suggest corrections.  Text Categorization and Information Filtering: In such applications, NLP represents the documents linguistically and compares each one to the others. In text categorization, the documents are grouped according to their linguistic representation characteristics into several categories. Information filtering signals out, from a collection of documents, the documents that are satisfying some criterion.  Information Retrieval: finds and collects relevant information to a given query. A user expresses the information need by a query, then the system attempts to match the given query to the database documents that is satisfying the user’s query. Query and documents are transformed into a sort of linguistic structure, and the matching is performed accordingly.  Summarization: according to an information need or a query from the user, this type of applications finds the most relevant part of the document.  Information Extraction: refers to the automatic extraction of structured information from unstructured sources. Structured information like entities, their relationships, and attributes describing them. This can integrate structured and unstructured data sources, if both are exist, and pose queries for spanning the integrated information giving better results than applying searches by keywords alone.  Question Answering: works with plain speech or text input, applies an information search based on the input. Such as IBM® Watson™ and the reigning JEOPARDY! Champion, which read
  30. 30. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  15  questions and understand their intention, then looking up the knowledge library to find a match.  Machine Translation: translate a given text from a specific natural language to another natural language, some applications have the ability to recognize the given text language even if the user didn't specify it correctly.  Data Fusion: Combining extracted information from several text files into a database or an ontology.  Optical Character Recognition: digitizing handwritten and printed texts. I.e. converting characters from images to digital codes.  Classification: this NLP application type sorts and organizes information into relevant categories. Like e-mail spam filters and Google News™ news service.  And also NLP entered other applications such as educational essay test-scoring systems, voice-mail phone trees, and even e- mail spam detection software. 2.3.2 Speech Technologies  Speech Recognition: mostly used on telephone voice response systems as a service client. Its task is processing plain speech. It is also used to convert speech into text.  Speech Synthesis: means converting text into speech. This process requires working at the level of phones and converting from alphabetic symbols into sound signals.
  31. 31. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  16  2.4 Natural Language Processing and Linguistics Natural Language Processing is concerned with three dimensions: language, algorithm and problem as presented in figure (2.1). On the language dimension, NLP considers linguistics; algorithm dimension mentions NLP techniques and tasks, while the problem dimension depicts the applied mechanisms to solve problems. [Bha12] 2.4.1 Linguistics Natural Language is a communication mean. It is a system of arbitrary signals such as the voice sound and written symbols. [Ali11] Linguistics is the scientific study of language; it starts from the simple acoustic signals which form sounds and ends with pragmatic understanding to produce the full context meaning. There are two major levels of linguistic, Speech Recognition (SR) and Natural Language Processing (NLP) as shown in figure (2.2). Figure (2.1) : NLP dimensions [Bha12]
  32. 32. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  17  2.4.1.1 Terms of Linguistic Analysis A natural language, as formal language does, has a set of basic components that may vary from one language to another but remain bounded under specific considerations giving the special characteristics to every language. From the computational view, a language is a set of strings generated over a finite alphabet and can be considered by a grammar. The definition Acoustic Signal Phones Letters and Strings Morphemes Words Phrases and Sentences Meaning out of Context Meaning in Context SR NLP Phonetics Phonology Lexicon Morphology Syntax Semantics Pragmatics Figure (2.2) : Linguistics analysis steps [Cha10]
  33. 33. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  18  of the three abstracted names is dependent on the language itself; i.e. strings, alphabet and grammar formulate and characterize language.  Strings: In natural language processing, the strings are the morphemes of the language, their combinations (words) and the combinations of their combinations (sentences), but linguistics going somewhat deeper than this. It starts with phones, the primitive acoustic patterns, which are significant and distinguishable from one natural language to another. Phonology groups phones together to produce phonemes represented by symbols. Morphemes consist of one or more symbols; thus, NLs can be further distinguished.  Alphabet: When individual symbols, usually thousands, represent words then the language is "logographic"; if the individual symbols represent syllables, it is a "syllabic" one. But when they represent sounds, the language is "alphabetic". Syllabic and alphabetic languages have typically less than 100 symbols, unlike logographic. English is an alphabetic language system consists of 26 symbols, these symbols represents phones combined into morphemes which may or may not combined further more to form words.  Grammar: Grammar is a set of rules specifying the legal structure of the language; it is a declarative representation about the language syntactic facts. Usually, grammar is represented by a set of productive rules.
  34. 34. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  19  2.4.1.2 Linguistic Units Hierarchy Language can be divided into pieces; there is a typical structure or form for every level of analysis. Those pieces can be put into a hierarchical structure starting from a meaningful sentence as the top level, proceeding in the separation of building units until reaching the primary acoustic sounds. Figure (2.3) presented an example. Figure (2.3) : Linguistic Units Hierarchy 2.4.1.3 Sentence Structure and Constituency "It is constantly necessary to refer to units smaller than the sentence itself units such as those which are commonly referred as CLAUSE, PHRASE, WORD, and MORPHEME. The relation between one unit and another unit of which it is a part is CONSTITUENCY." [Qui85] The task of dividing a sentence into constituents is a complex task ________________________________________________________ 1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries The teacher talked to the students The teach er talk ed to the student s The teacher talked to the students The teacher talked to the students Sentence Phrase Word Morphem e Phonemes1 ᶞᵊ ᵗ ː ᶴ ᵊ ᵓː ᵏ ᵗ ᵗu ᶞᵊ ʹˢᵗᴶː ᵈᵑᵗ ˢ
  35. 35. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  20  requires incorporating more than one analysis stage; tokenization, segmentation, parsing, (and sometimes stemming) usually merged together to build the parse tree for a given sentence. 2.4.1.4 Language and Grammar A language is a 'set' of sentences and a sentence is a 'sequence' of 'symbols' [Gru08]; it can be generated given its context free grammar G=(V,∑,S,P). [Cla10] Commonly, grammars are represented as a set of production rules which is taken by the parser and compared against the input sentences. Every matched rule adds something to the sentence complete structure which is called 'parse tree'. [Ric91] Context free grammar (CFG) is a popular method for generating formal grammars. It is used extensively to define languages syntax. The four components of the grammar are defined in CFG as [Sag13]:  Terminals (∑): represent the basic elements which form the strings of the language.  Nonterminals or Syntactic Variables (V): sets of strings define the language which is generated by the grammar. Nonterminals represent a key in syntax analyzing and translation via imposing a hierarchical structure for the language.  Set of production rules (P): this set define the way of combining terminals with nonterminals to produce strings. The production rule is consisted of a variable on the left side represents its head, this head defines  Start symbol (S). The following is an example describes the structure of English sentence
  36. 36. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  21  V = {S, NP, N, V P, V, Art} ∑ = {boy, icecream, dog, bite, like, ate, the, a}, P = {S NP V P, NP  N, NP  ART N, V P  V NP, N  boy | icecream | dog, V  ate | like | bite, Art  the | a} The grammar specifies two things about the language: [Ric91]  Its weak generative capacity; the limited set of sentences which can be completely matched by a series of grammar rules.  Its strong generative capacity, grammatical structure(s) of each sentence in the language. Generally, there are an infinite number of sentences for each grammar which can be structured with it. The strength and importance of grammars lurk in their ability of supplying structure to an infinite number of sentences because they succinctly summarize an infinite number of objects structures of a certain class. [Gru08] The grammar is said to be generative if it has a fixed size production rules which, if followed, can generate every sentence in the language using an infinite number of actions. [Gru08]
  37. 37. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  22  2.5 Natural Language Processing Techniques 2.5.1 Morphological Analysis Morphology is the study of how words are constructed from morphemes which represent the minimal meaning-bearing language primitive units.[Raj09] [Jur00] There are two broad classes of morphemes: stems and affixes; the distinction between the two classes is language dependent in that it varies from one language to another. The stem, usually, refers to the main part of the word and the affixes can be added to the words to give it additional meaning. [Jur00] Further more, affixes can be divided into four categories according to the position where they are added. Prefixes, suffixes, circumfixes and infixes generally refer to the different types of affixes but it is not necessary to a language to have all the types. English accept both prefixes to precede stems and suffixes to follow stems, while there is no good example for a circumfixe (precede and follow a stem) in English, and infixing (inserting inside the stem) is not allowed (unlike German and Philippine languages, consecutively) . [Jur00] Morphology is concerned with recognizing the modification of base words to form other words with different syntactic categories but similar meanings. Generally, three forms of word modifications are found [Jur00]:  Inflection: syntactic rules change the textual representation of the words; such as adding the suffix 's' to convert nouns into plurals, adding 'er' and 'est' convert regular adjectives into comparative and
  38. 38. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  23  superlative forms, consecutively. This type of modification usually results a word from the same word class of the stem word.  Derivation: new words are produced by adding morphemes, usually more complex and harder in meaning than inflectional morphology. It often occurs in a regular manner and results words differ in their word class from the stem word. Like adding the suffix 'ness' to 'happy' to produce 'happiness'.  Compounding: this type modifies stem words by another stem words by grouping them. Like grouping 'head' with 'ache' to produce 'headache'. In English, this type is infrequent. Morphological processing, also known as stemming, depends heavily on the analyzed language. The output is the set of morphemes that are combined to form words. Morphemes can be stem words, affixes, and punctuations. 2.5.2 Part Of Speech Tagging Part of Speech (POS) tagging is the process of giving the proper lexical information or POS tag (also known as word classes, lexical tags, and morphological classes), which is encoded as a symbol, for every word (or token) in a sentence. [Sco99] [Has06b] In English, POS tags are classified into four basic classes of words: [Qui85] 1. Closed classes: include prepositions, pronouns, determiners, conjunctions, modal verbs and primary verbs. 2. Open classes: include nouns, adjectives, adverbs, and full verbs. 3. Numerals: include numbers and orders. 4. Interjections: include small set of words like oh, ah, ugh, phew. Usually, a POS tag indicates one or more of the previous information and it is sometimes holds other features like the tense of the verb or the number
  39. 39. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  24  (plural or singular). POS tagging may generate tagged corpora or serve as a preprocessing step for the next NLP processes. [Sco99] Most of tagging systems performance is typically limited because they only use local lexical information available in the sentence, at the opposite of syntax analyzing systems which exploit both lexical and structural information. [Sco99] More research was done and several models and methods have been proposed to enhance taggers performance, they fall mainly into supervised and unsupervised methods where the main difference between the two categories is the set of training corpora that is pre tagged in supervised methods unlike unsupervised methods which needs advanced computational methods for gaining such a corpora. [Has06a] [Has06b]. Figure (2.4) presents the main categories and shows some examples. In both categories, the following are the most popular: Figure (2.4) : Classification of POS tagging models [Has06a]
  40. 40. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  25   Statistical (stochastic, or probabilistic) methods: taggers which use these methods are firstly trained on a correctly tagged set of sentences which allow the tagger to disambiguate words by extracting implicit rules or picking the most probable tag based on the words that are surrounding the given word in the sentence. Examples of these methods are Maximum-Entropy Models, Hidden Markov Models (HMM), and Memory Based models.  Rule based methods: a sequence of rules, a set of hand written rules, is applied to detect the best tags set for the sentence regardless of any maximization probability. The set of rules need to be written probably and checked by human experts. Examples: the path-voting constraint models and decision tree models.  Transformational approach: combines both statistical methods and rule based methods to firstly find the most probable set of available tags and then applies a set of rules to select the best.  Neural Networks: with linear separator or full neural network, have been used for tagging processes. The methods described above, as any other research areas, have their advantages and disadvantages; but there is a major difficulty facing all of them, it is the tagging of unknown words (words that have never seen before in the training corpora). While rule-based approaches depends on a special set of rules to handle such situations, stochastic and neural nets lack this feature and use other ways such as suffixes analysis and n- gram by applying morphological analysis; some methods use default set of tags to disambiguate unknown words. [Has06a]
  41. 41. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  26  2.5.3 Syntactic Analysis "Syntax is the study of the relationships between linguistics forms, how they are arranged in sequence, and which sequences are well- formed". [Yul00] Syntactic analysis, also referred by "Parsing", is the process of converting the sentence from its flat format which is represented as a sequence of words into a structure that defines its units and the relations between these units. [Raj09] Hence, the goal of this technique is to transform natural language into an internal system representation. The format of this representation may be dependency graphs, frames, trees or some other structural representations. Syntactic parsing attempts only for converting sentences into either dependency links representing the utterance syntactic structure or a tree structure and the output of this process is called "parse tree" or simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning in the level of the smallest parts ("words" in terms of language scientist, "tokens" in terms of computer scientists). [Gru08] Syntactic analysis makes use of both the results of morphological analysis and Part-Of-Speech tagging to build the structural description of the sentence by applying the grammar rules of the language under consideration; if a sentence violates the rules then it is rejected and assigned as incorrect. [Raj09] The two main components of every syntax analyzer are:  Grammar: the grammar provides the analyzer with the set of production rules that will lead it to construct the structure of the sentences and specifies the correctness of every given sentence.
  42. 42. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  27  Good grammars make a careful distinction between the sentence/word level, which they often call syntax or syntaxis and the word/letter level, which they call morphology. [Gru08]  Parser: the parser reconstructs the production tree (or trees) by applying the grammar to indicate how the given sentence (if correctly constructed) was produced from that grammar. Parsing is the process of structuring a linear representation in accordance with a given grammar. Today, most of parsers combine context free grammars with probability models to determine the most likely syntactic structure out of many others that are accepted as parse trees for an utterance. [Dzi04] 2.5.4 Semantic Analysis "Semantics is the study of the relationships between linguistic forms and entities in the words; that is, how words literally connect to things." [Yul00] This technique and the later following it are basically depended by language understanding. Semantic analysis is the process of assigning meanings to the syntactic structures of the sentences regardless of its context. [Yul00] [Raj09] 2.5.5 Discourse Integration Discourse analysis is concerned with studying the effect of sentences of each other. It shows how a given sentence is affected by the one preceding it and how it affects the sentence following it. Discourse Integration is relevant to understanding texts and paragraphs rather than simple sentences, so, discourse knowledge is important in the interpretation
  43. 43. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  28  of temporal aspects (like pronouns) in the conveyed information. [Ric91] [Raj09] 2.5.6 Pragmatic Analysis This step interprets the structure that represents what is said for determining what was meant actually. Context is a fundamental resource for processing here. [Ric91] 2.6 Natural Language Processing Challenges The challenges of natural language processing are much enough that can't be summarized in a limited list; with every processing step from the start point to results outputting there are a set of problems that natural language processors vary in their ability to handle. However, the application where NLP is used, usually, concerned with a specific task rather than considering all processing steps with all their details, this is an advantage for the NLP community helps to outline the challenges and problems according to the task under consideration. For our research area, we precisely concerned with the set of problems that are directly affecting the task of text correction; the next subsections describe some of them: 2.6.1 Linguistic Units Challenges: The task of text correction starts from the level of characters up to paragraphs and full texts, with every level there are a set of difficulties that the handling analyzer faces: 2.6.1.1 Tokenization In this process, the lexical analyzer, usually called "Tokenizer", divides the text into smaller units and the output of this step is a series of
  44. 44. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  29  morphemes, words, expressions and punctuations (called tokens). It involves locating tokens boundaries (where one token ends and another begins). Issues that arise in tokenization and should be addressed are [Nad11]:  Problem depends on language type: language includes, in addition to their symbols, a set of orthographic conventions which are used in writing to indicate the boundaries of linguistic units. English employs whitespaces to separate words but this isn't sufficient to tokenize a text in a complete and unambiguous manner because the same character may be used in different uses (as the case with punctuations), there are words with multi parts (such as dividing the word with a hyphen at the end of lines and some cases in the addition of prefixes) and many expressions consisted of more than one word.  Encoding Problems: syllabic and alphabetic writing systems, usually, encoded using single byte, but languages with larger character sets require more than two bytes. The problem arise when the same set of encodings represents different characters set; whereas, the tokenizers are targeted to a specific encoding for a specific language.  Other problems such as the dependency of the application requirements which indicates what a constituent is defined as a token; in computational linguistics the definition should precisely indicate what the next processing step requires. The tokeniser should also have the ability to recognize the irregularities in texts such as misspellings and erratic spacing and punctuation, etc. 2.6.1.2 Segmentation Segmenting text means dividing it into small meaningful pieces typically referred by "sentence", a sentence consists of one or more tokens
  45. 45. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  30  and handles a meaning which may not completely be clear. This task requires a full knowledge in the scope of punctuation marks since they are the major factor in denoting the start and ends of sentences. Segmentation becomes more complicated as the punctuations usages become more. Some of punctuations can be a part from a token and not a stopping mark such as the case with periods (.) when used with abbreviations. However, there is a set of factors can help in making the segmentation process more accurate [Nad11]:  Case distinction: English sentences normally start with a capital letter, (but Proper nouns also do).  POS tag: the tags that are surrounding punctuation can assist this process, but multi tags situations complicate it such as the using of –ing verbs as nouns.  The length of the word (in the case of abbreviation disambiguation, notice a period may assign the end of a sentence and an abbreviation at the same time).  Morphological information, this task requires finding the stem word by suffixes removal. It is likely not to separate tokenization and segmentation processes; they are usually merged together for solving most of the above problems, specifically segmentation problems. A sentence is described to be an indeterminate unit because of the difficulty in deciding where it ends and another starts; while the grammar is indeterminate from the stand point of deciding 'which sentence is grammatically correct?' because this question permits to be answered divisively and discourse segmentation difficulty is not the lonely reason but
  46. 46. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  31  also grammatical acceptability, meaning, style goodness or badness, lexical acceptability, context acceptability, etc. [Qui85] 2.6.2 Ambiguity An input is ambiguous if there is more than one alternative linguistic structure for it. [Jur00] Two major types of sentence ambiguity, genuine and computer ambiguity. In the first, the sentence is really has two different meanings to the intelligent hearer; while in the second case, is that the sentence has one meaning but for the computer it has more than one and this type is really a problem facing NLP applications unlike the first. [Not] Ambiguity as an NLP problem is found in every processing step [Not] [Bha12]: 2.6.2.1 Lexical Ambiguity Lexical ambiguity is described to be the possibility for a word to have more than one meaning or more than one POS tag. Obviously, meaning ambiguity leads to semantic ambiguity and tag ambiguity to syntactic ambiguity because it can produce more than one parse tree. Frequency is an available solution for this problem. 2.6.2.2 Syntactic Ambiguity The sentence has more than one syntactic structure; particularly, English common ambiguity sources are:  Phrase attachment: how a certain phrase or a clause in the sentence can be attached to another when there is more than one possibility. Crossing is not allowed in parse trees; therefore, a parser generates a parse tree for each accepted state.
  47. 47. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  32   Conjunction: sometimes, the parser befuddled to select which phrase a conjunctive should be connected to.  Noun group structure: the rule NG  NG NG allows English to generate long series of nouns to be strung together. Some of these problems can be resolved by applying syntactic constraints. 2.6.2.3 Semantic Ambiguity Even when a sentence is unambiguous lexically and syntactically, sometimes, there is more than one interpretation for it. This is because a phrase or a word may refer to more than one meaning. "Selection restrictions" or "semantic constraints" is a way to disambiguate such sentences. It combines two concepts in one mode if both of the concepts or one of them has specific features. Frequency in context also can help in deciding the meaning of a word. 2.6.2.4 Anaphoric Ambiguity This is the possibility for a word or a phrase to refer to something that is previously mentioned but in the reference there is more than one possibility. This type can be resolved by parallel structures or recency rules. 2.6.3 Language Change "All living languages change with time, it is fortunate that they do so rather slowly compare to the human life". Language change is represented by the change of grammars of people who speak the language and it has been shown that English was changed in its lexicon, phonological,
  48. 48. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  33  morphological, syntax, and semantic components of the grammar over the past 1,500 years. [Fro07] 2.6.3.1 Phonological Change Correspondences of regular sounds show the phonological system changes. The phonological system is governed, as well as any other linguistic system, by a set of rules and this set of phonemes and phonological rules is subjected to change by modification, deletion and addition of new rules. The change in phonological rules can affect the lexicon in that some of English words formations depends on sounds, such as the vowels sound differentiate nouns from verbs ( nouns house and bath from the verbs house and bathe). 2.6.3.2 Morphological Change Morphological rules, like the phonological, are suspected to addition, lose and change. Mostly, the usage of suffixes is the active area of change where the way of adding them to the ends of stems affected the resulted words and therefore changed the lexicon. 2.6.3.3 Syntactic Change Syntactic changes are influenced by morphological changes which in turn influenced by phonological changes. This type of change includes all types of grammar modifications that are mainly based on the reordering of words inside the sentence. 2.6.3.4 Lexical Change Change of lexical categories is the most common in this type of change. An example of this situation is the usage of nouns as verbs, verbs as nouns, and adjectives as nouns. Lexical change also includes the
  49. 49. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  34  addition of new words, borrowing or loan words from another language, and the loss of existing words. Figure (2.5) : An example of lexical change 1 2.6.3.5 Semantic Change As the category of a word can be changed, its semantic representation or meaning can be changed, too. Three types of change are possible for a word:  Broadening: the meaning of a word is expanded to mean everything it has been used for and more than that.  Narrowing: on the reverse of broadening, here the word meaning is reduced from more general meaning to a specific meaning.  Shifting: the word reference is shifted to refer to another meaning somewhat differs from the original one. _________________________________________________________ Darby Conley/ Get fuzzy © UFS, Inc. 24 Feb. 2012
  50. 50.  35 Part II Text Correction 2.7 Introduction Text correction is the process of indicating incorrect words in an input text, finding candidates (or alternatives) and suggesting the candidates as corrections to the incorrect word. The term incorrect refers to two different types of erroneous words: misspelled and misused. But mainly, the process is divided into two distinct phases: error detection phase which indicates the incorrect words, and error correction phase that combined both generating and suggesting candidates. Devising techniques and algorithms for correcting texts in an automatic manner is a primal opened research challenge started from the early 1960s and continued until now because the existed correction techniques are limited in their accuracy and application scope [Kuk92]. Usually, a correction application concerns a specific type of errors because it is a complex task to computationally predict an intended word written by a human. 2.8 Text Errors A word can be mistaken in two ways: the first is by incorrectly spelling a word due to lack of enough information about the word spell or intentionally mistaking symbol(s) within the word, this type of errors is known as non-word errors where the word can't be found in the language lexicon. The second is by using correctly spelled word in wrong position in the sentences or unsuitable context. These errors are known as real-word errors
  51. 51. Chapter Two_ Part II   Text Correction _______________________________________________________________________  36 where the incorrect word is accepted in the language lexicon. [Gol96][Amb08] Non-word errors are easier to be detected, unlike real-word errors; the later needs more information about the language syntax and semantic nature. Accordingly, the correction techniques are divided into isolated words error detections that is concerned with non-word errors; and context sensitive error correction which deals with real-words error. [Gol96] 2.8.1 Non-word errors Those errors include the words that are not found in the lexicon; a misspelled word contains one or more from the following errors:  Substitution: one or more symbols are changed.  Deletion: one or more symbols are missed from the intended word.  Insertion: adding symbol(s) to the front, end, or any index in the word.  Transposition: two adjacent symbols are swapped. The four errors are known as Damerau edit operations. 2.8.2 Real-word errors These errors occur through mistaking an intended word by another one that is lexically accepted. Real-word errors can be resulted from phonetic confusion like using the word "piece" instead of "peace" which usually leads to semantically unaccepted sentences, after applying non-word correction, or even from misspelling the intended word and producing another lexically accepted word. [Amb08] Sometimes, the confusion results in syntactically unaccepted sentences; like writing the sentence "John visit his uncle" instead of "John visits his uncle".
  52. 52. Chapter Two_ Part II   Text Correction _______________________________________________________________________  37 Correcting real-word errors is context sensitive in that it needs to check the surrounding words and sentences before suggesting candidates. 2.9 Error Detection Techniques Indicating whether a word is correct or not is based on the type of correction procedure; non-word error detection is usually checking the acceptance of a word in the language dictionary (the lexicon) and marks any mismatched word as incorrect. While real-word error is more complex task, it requires analysing larger parts from the text, typically, paragraphs and full text [Kuk92]. In this work, we mainly focus on non-word error detection techniques. Dustin defined spelling error as E in a given query word Q which is not an entry in the underhand dictionary D. [Bos05] He outlined an algorithm for spelling correction as shown in figure (2.6). Spell error detection techniques can be classified into two major types: 2.9.1 Dictionary Looking Up All the words of a given text are matched against every word in a pre created dictionary or a list of all acceptable words in the language under- consideration (or most of them since some languages have a huge number of words and collecting them totally is semi impossible task). The word is incorrect if and only if there is no match found. This technique is robust but suffers from the long time required for checking; as the dictionary size becomes larger, looking up time becomes longer. [Kuk92] [Mis13] 2.9.1.1 Dictionaries Resources There are many systems deal with collecting and updating languages lexical dictionaries. Example of these systems is the WordNet online application; it is a large database of English lexicons. Lexicons (nouns,
  53. 53. Chapter Two_ Part II   Text Correction _______________________________________________________________________  38 verbs, adjectives, articles …etc) are interlinked by lexical relations and conceptual-semantic. The structure of WordNet is a network of words and concepts that are related meaningfully and this structure made it a good tool for NLP and Computational Linguistics. Another example is the ISPELL text corrector; an online spell checker provides many interfaces for many western languages. ISPELL is the latest version of R. Gorin's spell checker which developed for Unix. Suggestion a spell correction is based on only one Levenshtein edit distance depending on looking up every token in the input text against a huge lexical dictionary. [ISP14] 2.9.1.2 Dictionaries Structures The standard looking up technique is to match every token in the dictionary with every token in the text, but this process requires a long time because NL dictionaries are usually of huge sizes and string matching needs longer time than other data types do. A solution for this challenge is to reduce the search space in such a way keeps similar tokens grouped together. Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05] Algorithm: Spell_correction Input: word w Output: suggestion(s) a set of alternatives for w Begin If (is_mistake(w)) Begin Candidates=get_candidates( w) Suggestions=filter_candidates( candidates) Return suggestions End Else Return is_correct End.
  54. 54. Chapter Two_ Part II   Text Correction _______________________________________________________________________  39 Grouping according to spell or phones [Mis13], and using hash tables are two fundamental ways to minimize search space. Hashing techniques apply a hash function to generate a numeric key from strings. The numeric keys are references to packets of tokens that can generate the same key indices; hash functions differ in their ability to distribute tokens and how much they minimize the search space. A perfect hash function generates no collisions (hashing two different tokens to the same key index), and a uniform hash function distribute tokens among packets uniformly. The optimal hash function is a uniform perfect hash function which hashes one token to every packet; such situation is impossible with dictionaries due to the variance of tokens. [Nie09] Spell and phones dependent groups use limited set of packets and generate keys according to spell or pronunciation; they are another style of hashing and sometimes of clustering. SPEEDCOP and Soundex are examples. [Mis13] [Kuk92] 2.9.2 N-gram Analysis N-grams are defined to be n subsequences of words or strings where n is variable, often takes values: one to produce unigrams (or monograms), two to produce bigrams (sometimes called "digrams"), three to produce trigrams, or rarely takes larger values. This technique detects errors by examining each n-gram from the given string and looking it with a precompiled n-gram statistics table. The decision depends on the existence of such n-gram or the frequency of it occurrence, if the n-gram is not found or highly infrequent then the words or strings which contain it are incorrect. [Kuk92] [Mis13]
  55. 55. Chapter Two_ Part II   Text Correction _______________________________________________________________________  40 2.10 Error Correction Techniques Many techniques have been proposed to solve the problem of generating candidates for the detected misspelled word; they vary in the required resources, application scope, time and space complexity, and accuracy. The most common are [Kuk92] [Mis13]: 2.10.1 Minimum Edit Distance Techniques This technique stands on counting the minimum number of primal operations required to convert the source string into the target one. Some researchers refer to primal operations to be insertion, deletion, and substitution of one letter by another; others add the transposition between two adjacent letters to be the fourth primal operation. Examples, Levenshtein Algorithm which counts one distance for every primal operation, Hamming Algorithm works like Levenshtein but limited with only strings of equal lengths; and Longest Common Substring finds the mutual substring between two words. Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has no limitation on the types of symbols, or on their lengths. It can be executed in time complexity of O(M.N) where M and N are the lengths of the two input strings. The algorithm can detect three types of errors (substitution, deletion, and insertion). It doesn't account the transposition of two adjacent symbols as one edit operation; instead, it counts such errors as two consecutive substituting operations giving edit distance of 2.
  56. 56. Chapter Two_ Part II   Text Correction _______________________________________________________________________  41 One of the well-known modifications of the original Levenshtein method was done by his friend Fred Damerau, who made a research and found that about 80% to 90% of errors are caused by the four types of error altogether which are known as Damerau-Levenshtein Distance. [Dam64] The modified method required execution time longer than the original; in every checking round, the method applies additional comparison to check whether a transposition took place in the string then applies another comparison to select the minimum value between the previous distance and the distance with the occurrence of a transposition operation. This step Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11] 1. Algorithm: Levenshtein Edit Distance 2. Input: String1, String2 3. Output: Edit Operations Number 4. Step1: Declaration 5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0, cost=0 6. Step2: Calculate Distance 7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do 10. for each symbol y in String2 do 11. begin 12. if x = y 13. cost = 0 14. else 15. cost = 1 16. r=index of x, c=index of y 17. min1 = (distance(r - 1, c) + 1) // deletion 18. min2 = (distance(r, c - 1) + 1) //insertion 19. min3 = (distance(r - 1,c - 1) + cost) //substitution 20. distance( r , c )=minimum(min1 ,min2 ,min3) 21. end 22. Step3: return the value of the last cell in the distance matrix 23. return distance(Length of String1,Length of String2) 24. End.
  57. 57. Chapter Two_ Part II   Text Correction _______________________________________________________________________  42 multiplied time complexity by factor of 2, resulting in Ω(2*M.N).Hence, in this work, the original Levenshtein method (figure (2.7)) is modified to consider the Damerau's four errors types within a time complexity shorter than the time consumed by Damerau-Levenshtein Algorithm and close to the original method. Figure (2.8) shows the modification of Damerau on Levenshtein method. 1. Algorithm: Damerau-Levenshtein Distance 2. Input: String1, String2 3. Output: Damerau Edit Operations Number 4. Step1: Declaration 5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0, cost=0 6. Step2: Calculate Distance 7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do 10. for each symbol y in String2 do 11. begin 12. if x = y 13. cost = 0 14. else 15. cost = 1 16. r=index of x, c=index of y 17. min1 = (distance(r - 1, c) + 1) // deletion 18. min2 = (distance(r, c - 1) + 1) //insertion 19. min3 = (distance(r - 1,c - 1) + cost) //substitution 20. distance( r , c )=minimum(min1 ,min2 ,min3) 21. if not(String1 starts with x) and not (String2 starts with y) then 22. if (the symbol preceding x= y) and (the symbol preceding y=x) then 23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost) 24. end 25. Step3: return the value of the last cell in the distance matrix 26. return distance(Length of String1,Length of String2) 27. End. Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]
  58. 58. Chapter Two_ Part II   Text Correction _______________________________________________________________________  43 2.10.2 Similarity Key Techniques As its name clarifies, this technique finds a unique key to group similarly spelled words together. The similarity key is computed for the misspelled word and mapped to a pointer refers to the group of words that are similar in their spell to the input one. Soundex algorithm finds keys depending on the pronunciation of the words, while the SPEEDCOP system rearranges the letters of the words by placing the first letter, followed by consonants, and finally vowels according to their occurrence sequence in the word and without duplication.[Kuk92] [Mis13] 2.10.3 Rule Based Techniques This approach applies a set of rules on the misspelled word depending on common mistakes patterns to transform the word into valid one. After applying all the applicable rules, the set of generated words that are valid in the dictionary suggested as candidates. 2.10.4 Probabilistic Techniques Two methods are mainly based on statistics and probability: 1) Transition Method: depends on the probability of a given letter to be followed by another one. The probability is estimated according to n- gram statistics from big size corpus. 2) Confusion Method: depends on the probability of a given letter to be confused or mistaken by another one. Probabilities in this method are source dependent, as example: Optical Character Recognition (OCR) systems vary in their accuracy and their basics in recognizing letters, and Speech Recognition (SR) systems usually confuse sounds.
  59. 59. Chapter Two_ Part II   Text Correction _______________________________________________________________________  44 2.11 Suggestion of Corrections Suggesting corrections may be merged within the candidates' generation; it is fully dependent on the output of the generation phase. The user is usually provided with a set of corrections, and then he/she can do a choice among them, keeps the written word unchanged, add the token to the dictionary, or rewrite the word in the cases when the desired word is not within the corrections list. Suggestions are listed in non-increasing order according to their similarity and suitability for replacing the source word. Similarity depends on the method of computing the distance or similarity between every candidate and the source token, while suitability depends on the surrounding words within the sentence boundary or the paragraph (in context sensitive correction, full text may be examined before making a suggestion). 2.12 The Suggested Approach The primal goal of this work is to find the nearest alternative word from all the available candidates in the underlying dictionary; when a non- word is encountered there are many candidates available to replace it, but the trick is here, which one of those alternatives was intended by the writer? The suggested work answers this question as in the following: All the dictionary tokens which their count may reach to some hundreds of thousands can be intended by the writer or none of them could be so. The writer (or typist) might really misspell the word or he/she wrote it perfectly but the problem is that the word is not found in the dictionary, i.e. never seen before and then it is an "unknown" token. The problem of deciding whether a word is misspelled or unknown is impossible to be solved. For this, the suggested system will assume every
  60. 60. Chapter Two_ Part II   Text Correction _______________________________________________________________________  45 unrecognized word is misspelled and may let the user makes the final decision. As an initial solution, all the tokens in the dictionary are candidates and in further processing the number of candidates must be minimized. 2.12.1 Find Candidates Using Minimum Edit Distance The starting step is to look for the most similar tokens in the lexicon dictionary and ranking them according to the minimum edit distance from the misspelled word. This action reduces the number of candidates to an acceptable amount depending on a threshold for the number of edit operations that should be applied to equalize the candidates and the misspelled word, or a maximum limit for number of candidates. The suggested system used Levenshtein method after being enhanced to consider the four Damerau edit operations. To find the similar tokens, the lexicon should be looked up and every token in it must be examined with the given word. This process consumes time because of the huge tokens held by the lexicon dictionary and the required time by the examining algorithm itself to find the minimum edit distance. Hence, the search space needs to shrink; a method is proposed to group similar tokens in semi clusters using spell properties. 2.12.2 Candidates Mining The best set of candidates is going under another processing step to specify how the generated candidates are related to the misspelled token and accordingly they should be ranked. The process is implemented using a vector of the following features:  Named-entity recognition: many issues are considered.  Transposition probability: Keyboard proximity and Physical Similarity.
  61. 61. Chapter Two_ Part II   Text Correction _______________________________________________________________________  46  Confusion probability: because phonetic errors are popular, this analysis help us to find if a word was misspelled because of replacing letter(s) with another of the same sound.  Starting and ending letters matching.  Candidates' length effect. A weighting scheme was applied to give each feature an effect role in deciding the best set of suggestions. However, the Similarity amount has the maximum part among the others. 2.12.3 Part Of Speech Tagging and Parsing Finally, the suitable candidate is chosen by the parser. The parser selects the candidate(s) that make(s) the sentence, which contains the misspelled word, correct. Tagging plays an important role in specifying the optimal candidate because filtering according to POS tag is the base on which the parser stands to select a candidate for its incomplete sentence. The selected tag is not only affect candidate but also every token in the sentence; it is the nature of English (and most of natural languages). The set of candidates, at this step, should contain the minimum number of elements but the best. Grammar checking, accomplished by parsing, is another goal of this system. The system applies sentence phrasing process and check each phrase consistency according to English grammar rules. When an incorrect structure is encountered, the system tries to re-correct it. Parsing is a fundamental step in specifying the correct choice of candidates since the basic goal is to give a correct sentence. The dependent dictionary is an integration of WordNet dictionary with ISPELL dictionary.
  62. 62. Chapter Two_ Part II   Text Correction _______________________________________________________________________  47 Figure 2.9 shows the block diagram of the suggested work; and in further chapters, more details are shown for each block. _____________________________________________________________ 1 Diagram in 2.9 is more detailed through the next three chapters Figure (2.9): The suggested system block diagram1 Preprocessing WordNet Lexical Dictionary Morphological analysis and POS tags Expansion ISPELL datasets Dictionaries Integration Hashing and Indexing POS Tagging Integrated Hashed Indexed Dictionary ------------ ------------ ------------ ------------ ------------ Tokens Stream Sentences Stream with tagged tokens Candidates Generation Sentences Recovery and Suggestions Listing ----------- ----------- ----------- ----------- ----------- ----- Phrase Level Suggestions Phrasing Candidates Ranking Grammar Correction
  63. 63.   Chapter Three Dictionary Structure and Looking Up Technique
  64. 64.  48  Chapter Three Hashed Dictionary and Looking Up Technique 3.1 Introduction Dictionary is a basic unit, mostly, in every NLP application. It holds the lexicon of the language under processing and related information according to the application purpose type such as POS tags, semantic information, phonetics, pronunciation and others. Typically, dictionaries are data structures supported in a format of a list of tokens or words collection. Each word (or token) is associated with its information that makes its usage by a NLP application becomes possible. The number of tokens held by a dictionary is a critical point in NLP applications, especially taggers and text correction systems; because as the number of tokens becomes smaller, the detected errors ratio also would be small since poor dictionary allows erroneous words to pass undetected. On the other hand, large sized dictionary increased this ratio but requires longer time for tokens looking up. Therefore, a balancing is needed to keep the size of a dictionary as inclusive as possible and the looking up speed fast. Many approaches have been proposed to handle this problem, some of these are indexing and hash functions. 3.2 Hashing The optimal feature of any dictionary is the availability of random access but strings are high various data type which makes this task impossible, at least from the sides of memory constraints.
  65. 65. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  49  Hashing is the process of converting a string S into an integer number within the period [0, M-1] where M is the number of available addresses in a predefined table. Hash functions made good promises from the area of random access, but alone!? No, the variance of language tokens requires an infinite hash table to hold every token "separately" and a variable size addressing buffer which may be unloadable by most of current systems as well as the highly wasted storage space. By "separately" we mean that no two strings have the same hash value, i.e. no collisions. As the number of collisions becomes larger the looking up inside packets becomes longer. However, an exploitation of hash function as a partial solution can be applied with other approaches to solve the shown up problem. While hash function can map tokens according to some of their features into size manageable packets, approaches such as indexing and advance search techniques would enhance looking up speed to a reasonable amount. 3.2.1 Hash Function The hash function in this work was created to exploit the spell of tokens as addressing key. It converts the prefix of tokens to be grouped into packets. English alphabet, the considered language of this work, contains the set of uppercase letters from 'A' to 'Z', lowercase letters from 'a' to 'z', and numbers from 0 to 9. In addition to some special purposes characters which are not avoidable in the dictionary because they are parts of some tokens such as slash (/), period (.), comma ('), underscore ( _ ), whitespace, and hyphen (-). The resulted characters set contains about 67 characters which can be reduced further more by replacing the numbers codes from 1 to 9 by
  66. 66. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  50  the code of 0 because the distinction between numbers has no such importance in this application for two reasons:  The difference between numbers is not a problem in the correction process since any system can never estimate what a number was intended by the writer; therefore any written number would be absolutely accepted.  If a distinction should be taken when treating numbers, then we need to cover every counted number in the dictionary, resulting in an infinite dictionary size because numbers are infinite. The final alphabet contains the union of the above mentioned sets and the reduced numbers set: ∑={ A,B,…,Z, a, b,…,z ,0, /, . , ' , - , _ , whitespace} which can be re-encoded using only 6 bits as shown in Table 3.1 (unused codes are referred by *) . Hashing according to prefixes is a good way to minimize the sizes of packets; it is similar to the SOUNDEX and SPEEDCOP methods [Mis13][Kuk92] in the fact that they shared the same goal, minimizing the size of search space, but it verses them in that this approach maps tokens to a predefined packets addresses depending on a limited length from the string prefix while those methods uses the total length and filters the letters according to sound or spell. This difference gave the suggested approach two interested features: 1. The hash function is simple and can be applied directly without any considerations for pre processing; SOUNDEX needs to encode letters into their phonetic groups, and SPEEDCOP rearranges letters.
  67. 67. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  51  Symbol Code Symbol Code Symbol Code A 0 B 1 C 2 D 3 E 4 F 5 G 6 H 7 I 8 J 9 K 10 L 11 M 12 N 13 O 14 P 15 Q 16 R 17 S 18 T 19 U 20 V 21 W 22 X 23 Y 24 Z 25 a 26 b 27 c 28 d 29 e 30 f 31 g 32 h 33 i 34 j 35 k 36 l 37 m 38 n 39 o 40 p 41 q 42 r 43 s 44 t 45 u 46 v 47 w 48 x 29 y 50 z 51 ' 52 / 53 - 54 _ 55 . 56 0 57 whitespace 58 * 59 * 60 * 61 * 62 * 63 Table 3.1: Alphabet Encoding
  68. 68. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  52  2. A random access is established by using the output of the hash function as an address while both previous methods need to search for a matching between the computed value and the stored codes. 3.2.2 Formulation As mentioned above, the size of the alphabet reduced to only 59 symbols which can be encoded using only 6 bits instead of the standard 8 bits, making a series of hashing functions available to be applied 1, 2, or any longer sequence of symbols. But this is another area for discussion, if the length of the prefix is too small then the packets number would be small also; therefore, they hold large number of tokens resulting in longer looking up time. On the other hand, using long prefixes creates large number of packets and some of them usually are sparse because of the variance and the irregularity of tokens which is a characteristic of natural languages. The function depends on using a three characters prefix C1C2C3 and converts it as presented in Table (3.1) into integers, then computes the hash value H according to Equation.1: H(C1,C2,C3)= _________ (3.1) H represents the packet address where tokens that are starting with same prefix are held. Obviously, the number of the available packets addresses is equal to the number obtained from residing the three symbols binary codes as shown in Table (3.2), where the symbol at index 0 is 'A' and symbol at index 63 (the last available index in the alphabet) is the unused cell which referred by '*'.
  69. 69. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  53  Start Address= (C1)2||(C2)2||(C3)2=(000000000000000000)2=(0)10 End Address= (C1)2||(C2)2||(C3)2=(111111111111111111)2=(262143)10 This makes the total number of packets= 2 18 =262144 packets. Some of these packets are empty because their addresses do not match an actual token prefix in the lexicon but the distribution of tokens among packets reduced the search space to a manageable size especially when the hash function has been combined with an indexing scheme to build the dictionary in a two levels structure. Starting Address Encoding End Address Encoding Alphabetic Encoding Decimal Encoding Binary Encoding Alphabetic Encoding Decimal Encoding Binary Encoding C1 A 0 000000 * 63 111111 C2 A 0 000000 * 63 111111 C3 A 0 000000 * 63 111111 3.2.3 Indexing Key-indexing is an in-memory lookup technique based strictly on direct addressing into an array with no comparisons between keys made. Its area of applicability is limited to numeric keys falling in a limited range defined by the available memory resources. Hashing helps direct addressing to work on keys for any type and range by bringing serial search and collision resolution policies into the equation. Table 3.2: Addressing Range
  70. 70. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  54  Indexing exploited for creating a reference table that holds the 218 packets heads addresses which can be addressed directly by the hash function. Every record in the reference table contains two fields: the first is "base" field which holds an address if its index match a token prefix, otherwise its value is (-1). The second is the "limit" field that holds the length of the primary packet that related to its index. Looking up for a packet contains tokens starting with a specific prefix is shown in figure (3.1). The packets referred by the reference table are treated as primary packets, which hold 3-symbols prefix identical tokens; for further reduction for the search space, sub packets can be created for every primary packet. The second level of tokens distribution is also based on their prefixes but with longer sequences. Instead of using only three symbols to group tokens with identical prefixes, the prefix equality expanded to 6 symbols by subdividing tokens inside primary packets into more secondary packets Figure (3.1): Token Hashing Algorithm Algorithm: Token Hashing Input: English token (finite string over ∑), reference and hash tables. Output: packet head address where the input token may rely. Step1: set variables C1,C2, and C3 to the input token prefix. Step2: Compute Index from C1, C2, and C3. Index= Step3: go to reference table at the record indexed with Index. Step4: examine Address field if Base > -1 return (Base value) else return fail End.
  71. 71. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  55  which consist of a head and a set of tokens that are identical to the head in their first 6 symbols. The structure of the dictionary can be clarified by hashing the exemplar token ABCDEFGH according to the approach described previously. (1) The dollar sign ($) refers to any sequence may follow S i Figure (3.2) : Dictionary Structure and Indexing Scheme C1=A, C2=B, C3=C Reference Index= H(C1,C2,C3)=Index Index : Head address =X : Length =Y ABCS0$ ABCS1$ ABCS2$ ABCS3$ : : ABCSY-1$ ABCDEFT0 ABCDEFT1 ABCDEFT2 ABCDEFT3 : : ABCDEFTR-1 Primary Packet 1 "Head Code="ABC Si="DEF" Secondary Packet
  72. 72. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  56  An interested characteristic in secondary packets is that no more space is wasted, because it is not based on a predefined packets structure. The secondary head, which is a token within primary packet, may be followed by tokens sharing it the same 6-symbol prefix which are collected in one variable size secondary packet; or may not be followed, then no need for a secondary packet. 3.3 Looking Up Procedure As shown in figure (3.2), the process of looking for a target token is started when the primary packet head address becomes in hand from the reference table which in turn computed using the hash function. At hash table, where the tokens are stored according to indexes, the search process begins with a random access accomplished by the index of the primary packet head, and the matching is done sequentially. The matching is happened on the forth through the sixth symbols from every token related to that primary packet; such an action reduces comparison time since matching all the sequence requires longer time. Even the reduction is infinitely short but it is useful in similar cases because logic operations on strings differ from other data types. When a full matching is found the target token is compared to the token at that record completely, if they are matched the goal is reached; otherwise, searching continued in the secondary packet related to that token (if there is a one related to the current token). The comparison inside secondary packets, unlike primary packets, uses full token length and failure here infers that there is no chance to find the targeted token in the dictionary. The algorithm in figure (3.3) outlines the looking up procedure after gaining primary head address.

×