The document proposes an improved version of the Porter stemming algorithm to address some of its limitations. The new stemmer includes additional steps and rules to better handle irregular forms, verb conjugations, plurals, and other exceptions. An evaluation shows the new stemmer reduces understemming and overstemming errors compared to the original Porter stemmer and other algorithms, improving stemming accuracy and potentially information retrieval performance.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
This document provides an overview of tools that can be used to create parsers. It discusses parser generators, which can generate code for a parser, and parser combinators, which are libraries used to build parsers. Popular parser generators for regular languages include JFlex and ANTLR, while tools like ANTLR, APG, and JavaCC can generate parsers for context-free languages. The document also provides basic introductions to key parser concepts like lexical analysis, parse trees, grammars, and language types.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...NextMove Software
LeadMine is a system that uses large grammars and dictionaries to recognize chemical entities in text, without an explicit tokenization step. It enhances recall through abbreviation detection and enhances precision by removing abbreviations of non-entities. When trained on additional data, LeadMine achieved 86.2% precision and 85.0% recall on a test set for the task of identifying all chemical entity mentions. It combines the capabilities of grammars to recognize regular entities with the broad coverage of dictionaries.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
This document provides an overview of tools that can be used to create parsers. It discusses parser generators, which can generate code for a parser, and parser combinators, which are libraries used to build parsers. Popular parser generators for regular languages include JFlex and ANTLR, while tools like ANTLR, APG, and JavaCC can generate parsers for context-free languages. The document also provides basic introductions to key parser concepts like lexical analysis, parse trees, grammars, and language types.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
LeadMine: A grammar and dictionary driven approach to chemical entity recogni...NextMove Software
LeadMine is a system that uses large grammars and dictionaries to recognize chemical entities in text, without an explicit tokenization step. It enhances recall through abbreviation detection and enhances precision by removing abbreviations of non-entities. When trained on additional data, LeadMine achieved 86.2% precision and 85.0% recall on a test set for the task of identifying all chemical entity mentions. It combines the capabilities of grammars to recognize regular entities with the broad coverage of dictionaries.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
The document discusses text mining, including defining it as the extraction of information from unstructured text using computational methods. It covers topics such as structured vs unstructured data, common text mining practice areas like information retrieval and document clustering, and challenges in text mining including ambiguity in language. Pre-processing techniques for text mining are also outlined, such as normalization, tokenization, stemming and removing stop words to clean and prepare text for analysis.
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
This document describes a system for normalizing Spanish tweets using finite-state transducers. The system consists of three main components: 1) An analyzer that tokenizes tweets and identifies standard words, 2) Transducers that generate candidate words for out-of-vocabulary tokens by implementing linguistic models of spelling variation, 3) A statistical language model that selects the most probable sequence of words. The transducers are based on weighted finite-state automata and implement models of texting features such as logograms, shortenings, and non-standard spellings. An evaluation shows the system is effective at normalizing Spanish tweets.
This document discusses several issues in part-of-speech (POS) tagging for Tamil language texts. It identifies challenges such as resolving lexical ambiguity, normalization of corpora, and analyzing compound words and multi-word expressions. It also discusses issues like determining the fineness versus coarseness of linguistic analysis, distinguishing syntactic function from lexical category, and whether to use new tags or tags from standard tagsets. Specific examples are provided to illustrate POS ambiguity depending on context, like the word "paṭi" which can be a noun, verb, adjective, adverb, or particle.
A stemming algorithm reduces inflected words to their word stem or root form. It works by removing suffixes and endings while trying to leave the stem in a familiar form. Developing a good stemming algorithm requires understanding the language's grammar, morphology, and common word forms. The algorithm is built incrementally and rules are evaluated based on whether they improve or degrade search performance across a test vocabulary. Irregular forms and stopwords also need to be handled.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...CSCJournals
This paper describes a lightweight, scalable model that predicts whether a Dutch verb ends in -d, -t or -dt. The confusion of these three endings is a common Dutch spelling mistake. If the predicted ending is different from the ending as written by the author, the system will signal the dt-mistake. This paper explores various data sources to use in this classification task, such as the Europarl Corpus, the Dutch Parallel Corpus and a Dutch Wikipedia corpus. Different architectures are tested for the model training, focused on a transfer learning approach with ULMFiT. The trained model can predict the right ending with 99.4% accuracy, and this result is comparable to the current state-of-the-art performance. Adjustments to the training data and the use of other part-of-speech taggers may further improve this performance. As discussed in this paper, the main advantages of the approach are the short training time and the potential to use the same technique with other disambiguation tasks in Dutch or in other languages.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
Automatic Transcription Of English Connected Speech PhenomenaITIIIndustries
Many phonetic phenomena that occur in connected speech are classified as phonetic periphery where anything can happen. A well-known convenient way to fix any phonetic phenomenon using certain symbols is transcription. The current paper aims at showing the model of predicting allophones by coordinating a number of factors that determine the choice of a particular allophone and visualizing the result changing certain letters into corresponding IPA symbols. Free Pascal compiler and Geany editor are used for programming purposes. The model is created for American English. It is tested for tap and glottal burst, the latter being one of the three glottalization patterns. The difference of the combination of factors for purely linguistic analysis and for computer programming is explained. We demonstrate (i) the framework for integrating separate blocks each dealing with one phenomenon (ii) a block for tapping which is almost finalized and a part of a block on glottalization, particularly patterns for glottal burst.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
This document describes the design and evaluation of a rule-based stemming algorithm for the Kambaata language. The algorithm uses a single-pass, longest-matching approach and is context-sensitive. It was evaluated on a test set of 2,425 words, correctly stemming 2,349 words (96.87%), overstemming 63 words (2.60%), and understemming 13 words (0.54%). The stemmer achieved a 65.86% reduction in dictionary size. Errors are due to the language's complex morphology involving suffixation, infixation, compounding, blending and reduplication. This represents the first stemming algorithm developed for Kambaata words.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
Lecture 7- Text Statistics and Document ParsingSean Golliher
This document discusses various techniques for text processing and indexing documents for information retrieval systems. It covers topics like tokenization, stemming, stopwords, n-grams to identify phrases, and weighting important document elements like headers, anchor text, and metadata. The document also discusses using links between documents for link analysis and utilizing anchor text for retrieval.
This document presents an enhanced suffix stripping algorithm to improve information retrieval. It discusses stemming, which reduces words to their root form to increase retrieval efficiency. Porter's stemming algorithm is widely used but has drawbacks. The proposed enhanced algorithm adds new rules to address inaccuracies in Porter's algorithm and reduce over-stemming and under-stemming errors. An analysis shows the enhanced algorithm reduces the index size more than Porter's algorithm and has lower over-stemming and under-stemming indexes, improving retrieval effectiveness.
Smart grammar a dynamic spoken language understanding grammar for inflective ...ijnlc
1. The document proposes SmartGrammar, a new method for developing spoken language understanding grammars for inflectional languages like Italian.
2. SmartGrammar uses a morphological analyzer to convert user utterances into their canonical forms before parsing, allowing the grammar to contain only canonical word forms rather than all possible inflections.
3. This significantly reduces the complexity and size of grammars for inflectional languages by representing many possible inflected forms with a single canonical form entry, making grammar development and management easier.
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET Journal
This document summarizes a survey on using deep learning to develop grammar checkers for Indian languages. It discusses different types of existing grammar checkers such as rule-based, statistical, and hybrid approaches. Grammar checkers have been developed for several languages including English, Afan Oromo, Portuguese, Punjabi, Amharic, Swedish, Nepali, Icelandic, and Hindi. Most use rule-based approaches where input sentences are checked against predefined grammar rules. Statistical approaches check input against annotated corpora. Hybrid approaches combine rule-based and statistical methods. The survey concludes that while many grammar checkers exist for other languages, only a limited number have been developed for Indian languages, leaving room for future research on developing deep
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
The document discusses text mining, including defining it as the extraction of information from unstructured text using computational methods. It covers topics such as structured vs unstructured data, common text mining practice areas like information retrieval and document clustering, and challenges in text mining including ambiguity in language. Pre-processing techniques for text mining are also outlined, such as normalization, tokenization, stemming and removing stop words to clean and prepare text for analysis.
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
This document describes a system for normalizing Spanish tweets using finite-state transducers. The system consists of three main components: 1) An analyzer that tokenizes tweets and identifies standard words, 2) Transducers that generate candidate words for out-of-vocabulary tokens by implementing linguistic models of spelling variation, 3) A statistical language model that selects the most probable sequence of words. The transducers are based on weighted finite-state automata and implement models of texting features such as logograms, shortenings, and non-standard spellings. An evaluation shows the system is effective at normalizing Spanish tweets.
This document discusses several issues in part-of-speech (POS) tagging for Tamil language texts. It identifies challenges such as resolving lexical ambiguity, normalization of corpora, and analyzing compound words and multi-word expressions. It also discusses issues like determining the fineness versus coarseness of linguistic analysis, distinguishing syntactic function from lexical category, and whether to use new tags or tags from standard tagsets. Specific examples are provided to illustrate POS ambiguity depending on context, like the word "paṭi" which can be a noun, verb, adjective, adverb, or particle.
A stemming algorithm reduces inflected words to their word stem or root form. It works by removing suffixes and endings while trying to leave the stem in a familiar form. Developing a good stemming algorithm requires understanding the language's grammar, morphology, and common word forms. The algorithm is built incrementally and rules are evaluated based on whether they improve or degrade search performance across a test vocabulary. Irregular forms and stopwords also need to be handled.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...CSCJournals
This paper describes a lightweight, scalable model that predicts whether a Dutch verb ends in -d, -t or -dt. The confusion of these three endings is a common Dutch spelling mistake. If the predicted ending is different from the ending as written by the author, the system will signal the dt-mistake. This paper explores various data sources to use in this classification task, such as the Europarl Corpus, the Dutch Parallel Corpus and a Dutch Wikipedia corpus. Different architectures are tested for the model training, focused on a transfer learning approach with ULMFiT. The trained model can predict the right ending with 99.4% accuracy, and this result is comparable to the current state-of-the-art performance. Adjustments to the training data and the use of other part-of-speech taggers may further improve this performance. As discussed in this paper, the main advantages of the approach are the short training time and the potential to use the same technique with other disambiguation tasks in Dutch or in other languages.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
Automatic Transcription Of English Connected Speech PhenomenaITIIIndustries
Many phonetic phenomena that occur in connected speech are classified as phonetic periphery where anything can happen. A well-known convenient way to fix any phonetic phenomenon using certain symbols is transcription. The current paper aims at showing the model of predicting allophones by coordinating a number of factors that determine the choice of a particular allophone and visualizing the result changing certain letters into corresponding IPA symbols. Free Pascal compiler and Geany editor are used for programming purposes. The model is created for American English. It is tested for tap and glottal burst, the latter being one of the three glottalization patterns. The difference of the combination of factors for purely linguistic analysis and for computer programming is explained. We demonstrate (i) the framework for integrating separate blocks each dealing with one phenomenon (ii) a block for tapping which is almost finalized and a part of a block on glottalization, particularly patterns for glottal burst.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
This document describes the design and evaluation of a rule-based stemming algorithm for the Kambaata language. The algorithm uses a single-pass, longest-matching approach and is context-sensitive. It was evaluated on a test set of 2,425 words, correctly stemming 2,349 words (96.87%), overstemming 63 words (2.60%), and understemming 13 words (0.54%). The stemmer achieved a 65.86% reduction in dictionary size. Errors are due to the language's complex morphology involving suffixation, infixation, compounding, blending and reduplication. This represents the first stemming algorithm developed for Kambaata words.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
Lecture 7- Text Statistics and Document ParsingSean Golliher
This document discusses various techniques for text processing and indexing documents for information retrieval systems. It covers topics like tokenization, stemming, stopwords, n-grams to identify phrases, and weighting important document elements like headers, anchor text, and metadata. The document also discusses using links between documents for link analysis and utilizing anchor text for retrieval.
This document presents an enhanced suffix stripping algorithm to improve information retrieval. It discusses stemming, which reduces words to their root form to increase retrieval efficiency. Porter's stemming algorithm is widely used but has drawbacks. The proposed enhanced algorithm adds new rules to address inaccuracies in Porter's algorithm and reduce over-stemming and under-stemming errors. An analysis shows the enhanced algorithm reduces the index size more than Porter's algorithm and has lower over-stemming and under-stemming indexes, improving retrieval effectiveness.
Smart grammar a dynamic spoken language understanding grammar for inflective ...ijnlc
1. The document proposes SmartGrammar, a new method for developing spoken language understanding grammars for inflectional languages like Italian.
2. SmartGrammar uses a morphological analyzer to convert user utterances into their canonical forms before parsing, allowing the grammar to contain only canonical word forms rather than all possible inflections.
3. This significantly reduces the complexity and size of grammars for inflectional languages by representing many possible inflected forms with a single canonical form entry, making grammar development and management easier.
IRJET- Survey on Grammar Checking and Correction using Deep Learning for Indi...IRJET Journal
This document summarizes a survey on using deep learning to develop grammar checkers for Indian languages. It discusses different types of existing grammar checkers such as rule-based, statistical, and hybrid approaches. Grammar checkers have been developed for several languages including English, Afan Oromo, Portuguese, Punjabi, Amharic, Swedish, Nepali, Icelandic, and Hindi. Most use rule-based approaches where input sentences are checked against predefined grammar rules. Statistical approaches check input against annotated corpora. Hybrid approaches combine rule-based and statistical methods. The survey concludes that while many grammar checkers exist for other languages, only a limited number have been developed for Indian languages, leaving room for future research on developing deep
Designing a Rule Based Stemmer for Afaan Oromo TextWaqas Tariq
Most natural language processing systems use stemmer as a separate module in their architecture. Specially, it is very significant for developing, machine translator, speech recognizer and search engines. In this thesis work, a stemming system for Afan Oromo is presented. This system takes as input a word and removes its affixes according to a rule based algorithm. The result of the study is a prototype context sensitive iterative stemmer for Afan Oromo. Error counting technique was employed to evaluate the performance of this stemmer. The errors were analyzed and classified into two different categories: under stemming and over stemming errors. For testing purpose corpus which is collected from different public Afaan Oromo newspapers and bulletins is used. Newspapers, bulletins and public magazines are considered as consisting different issues of the community: social, economical, technological and political issues. This will reduce the probability of making the corpus biased toward some specific words that do not appear in everyday life. to make the testing set balanced. According to the evaluation of the experiments, it can be concluded that an overall accuracy of the stemmer is encouraging which shows stemming can be performed with low error rates in high inflected languages such as Afan Oromo.
Speech synthesis we can, in theory, mean any kind of synthetization of speech. For example, it can be the process in which a speech decoder generates the speech signal based on the parameters it has received through the transmission line, or it can be a procedure performed by a computer to estimate some kind of a presentation of the speech signal given a text input. Since there is a special course about the codecs (Puheen koodaus, Speech Coding), this chapter will concentrate on text-to-speech synthesis, or shortly TTS, which will be often referred to as speech synthesis to simplify the notation. Anyway, it is good to keep in mind that irrespective of what kind of synthesis we are dealing with, there are similar criteria in regard to the speech quality. We will return to this topic after a brief TTS motivation, and the rest of this chapter will be dedicated to the implementation point of view in TTS systems. Text-to-speech synthesis is a research field that has received a lot of attention and resources during the last couple of decades – for excellent reasons. One of the most interesting ideas (rather futuristic, though) is the fact that a workable TTS system, combined with a workable speech recognition device, would actually be an extremely efficient method for speech coding). It would provide incomparable compression ratio and flexible possibilities to choose the type of speech (e.g., breathless or hoarse), the fundamental frequency along with its range, the rhythm of speech, and several other effects. Furthermore, if the content of a message needs to be changed, it is much easier to retype the text than to record the signal again. Unfortunately this kind of a system does not yet exist for large vocabularies. Of course there are also numerous speech synthesis applications that are closer to being available than the one discussed above. For instance, a telephone inquiry system where the information is frequently updated, can use TTS to deliver answers to the customers. Speech synthesizers are also important to the visually impaired and to those who have lost their ability to speak. Several other examples can be found in everyday life, such as listening to the messages and news instead of reading them, and using hands-free functions through a voice interface in a car, and so on.
This document is a preprint that summarizes a paper on developing a spell checker model using automata. It discusses using fuzzy automata rather than finite automata for improved string comparison. The proposed spell checker would incorporate autosuggestion features into a Windows application. It would use techniques like edit distance and tries to store dictionaries and suggest corrections. The paper outlines the design of the spell checker, discussing functions like comparing words to a dictionary and considering morphology. Advantages like improved accuracy and speed are discussed along with potential disadvantages like inability to detect all errors.
This paper presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model achieves high accuracy by resolving most ambiguities at the lexical level using dictionary information, but handles cases requiring syntactic context in the parsing process. The complexity usually associated with parsing is reduced by pruning implausible segmentations prior to parsing. The approach is implemented in a natural language understanding system developed at Microsoft Research.
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
English to punjabi machine translation system using hybrid approach of word sIAEME Publication
This document describes a hybrid machine translation and word sense disambiguation system for translating English sentences to Punjabi. The system uses conditional random fields to disambiguate words with multiple meanings by determining the category with the highest word frequency in the context. The system achieves 81.2% accuracy on test sentences by first analyzing sentences, then synthesizing the translation while addressing word ambiguities, and finally outputting the translated sentence.
Similar to A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL (20)
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Determination of Equivalent Circuit parameters and performance characteristic...pvpriya2
Includes the testing of induction motor to draw the circle diagram of induction motor with step wise procedure and calculation for the same. Also explains the working and application of Induction generator
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
This study Examines the Effectiveness of Talent Procurement through the Imple...DharmaBanothu
In the world with high technology and fast
forward mindset recruiters are walking/showing interest
towards E-Recruitment. Present most of the HRs of
many companies are choosing E-Recruitment as the best
choice for recruitment. E-Recruitment is being done
through many online platforms like Linkedin, Naukri,
Instagram , Facebook etc. Now with high technology E-
Recruitment has gone through next level by using
Artificial Intelligence too.
Key Words : Talent Management, Talent Acquisition , E-
Recruitment , Artificial Intelligence Introduction
Effectiveness of Talent Acquisition through E-
Recruitment in this topic we will discuss about 4important
and interlinked topics which are
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Transcat
Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis.
Fluke Solar Application Specialist Will White is presenting on this engaging topic:
Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL
1. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
DOI : 10.5121/ijnsa.2013.5411 143
A NEW STEMMER TO IMPROVE INFORMATION
RETRIEVAL
Wahiba Ben Abdessalem Karaa
University of Tunis. Higher Institute of Management, Tunisia
RIADI-GDL laboratory, ENSI, National School of Computer Sciences. Tunisia
wahiba.abdessalem@isg.rnu.tn
ABSTRACT
A stemming is a technique used to reduce words to their root form, by removing derivational and
inflectional affixes. The stemming is widely used in information retrieval tasks. Many researchers
demonstrate that stemming improves the performance of information retrieval systems. Porter stemmer is
the most common algorithm for English stemming. However, this stemming algorithm has several
drawbacks, since its simple rules cannot fully describe English morphology. Errors made by this stemmer
may affect the information retrieval performance.
The present paper proposes an improved version of the original Porter stemming algorithm for the English
language. The proposed stemmer is evaluated using the error counting method. With this method, the
performance of a stemmer is computed by calculating the number of understemming and overstemming
errors. The obtained results show an improvement in stemming accuracy, compared with the original
stemmer, but also compared to other stemmers such as Paice and Lovins stemmers. We prove, in addition,
that the new version of porter stemmer affects the information retrieval performance.
Keywords
Stemming, porter stemmer, information retrieval
1. INTRODUCTION
Stemming is a technique to detect different inflections and derivations of morphological variants
of words in order to reduce them to one particular root called stem. A word's stem is its most
elementary form which may or may not have a semantic interpretation. In documents written in
natural language, it is hard to retrieve relevant information. Since the Languages are characterized
by various morphological variants of words, this leads to mismatch vocabulary. In applications
using stemming, documents are represented by stems rather than by the original words. Thus, the
index of a document containing the words "computing", "compute" and "computer" will map all
these words to one common root which is "compute". This means that stemming algorithms can
considerably reduce the document index size, especially for highly inflected languages, which
leads to important efficiency in time processing and memory requirements.
First researches about stemming have been done in English. This language has a relatively simple
morphology. A diversity of stemming algorithms have been proposed for the English language
such as Lovins stemmer [1], Paice/Husk stemmer [2],
2. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
144
2. PORTER STEMMER
Porter stemmer was developed by Martin Porter in 1980 at the University of Cambridge [4].
Porter’s algorithm is applied in many fields as a pre-processing step for the indexing task; its
main use is as part of a term normalization process that is usually done when setting up an
Information retrieval system. The Porter stemmer is actually the most commonly used of all the
stemmers. It showed improvements in retrieval performance and in other fields such as
classification, clustering, spam filtering…It becomes the most popular and the standard approach
of stemming.
The stemmer is based on the idea that the suffixes in the English language are mostly built of a
combination of smaller and simpler suffixes; for instance, the suffix "fullness" is composed of
two suffixes "full" and "ness". Thus, Porter stemmer is a linear step stemmer; it applies
morphological rules sequentially allowing removing affixes in stages.
Specifically, the algorithm has five steps. Each step defines a set of rules. To stem a word, the
rules are tested sequentially; if one of these rules matched the current word, then the conditions
attached to that rule are tested. Once a rule is accepted; the suffix is removed and control moves
to the next step. If the rule is not accepted then the next rule in the same step is tested, until either
a rule from that step is accepted or there are no more rules in that step and hence the control
passes to the next step. This process continues for all the five steps; in the last step the resultant
stem is returned by the stemmer. The whole algorithm can be resumed by the following activity
diagram (Figure 1):
recode double suffix to simple suffix
recode remaing double suffixes
recode pluyrals and simple present
recode past and present participle
recode "y" to "i"
recode remaining simple suffixes
recode stem with -e
Stem
recode stem with -l
Step 1
Step 2
Step 3
Step 4
Step 5
Word
Figure 1. The steps of Porter Stemming Algorithm
Words’ suffixes are removed step by step. The first step of the algorithm handles plurals, past
participles, present participles, and transforms a terminal “y” to an “I”. For example:
“generalizations” is converted to “generalization”, “agreed” to “agree”, and “happy” to
“happi”. The second step deals with double suffixes to single ones. For example:
3. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
145
“generalization” is converted to “generalize”, “oscillator” to “oscillate”. The third step
removes other double suffixes not handled in the previous step such as: “generalize” is changed
into “general”.
The fourth step removes remaining suffixes such as: “general” is transformed into “gener”,
“oscillate” to “oscill”.
The fifth step treaties stems ending with –e, and treats words ending in double consonant. For
example: “attribute” is recoded “attribut”, “oscill” is converted to “oscil”.
3. PORTER STEMMER ERRORS
Although Porter stemmer is known to be powerful, it still faces many problems. The major ones
are Overstemming and Understemming errors. The first concept denotes the case where a word is
cut too much which may lead to the point where completely different words are conjoined under
the same stem. For instance, the words "general" and "generous" are stemmed under the same
stem "gener". The latter describes the opposite case, where a word is not cut enough. This can
lead to a situation where words derived from the same root do not have the same stem. This is due
essentially to the fact that Porter stemmer ignores many cases and disregards many exceptions.
For example, Porter stemmer does not treat irregular verbs: “bought” remains “bought”, and
“buy” is stemmed “bii”, whereas normally the two words have the same stem “buy”. Irregular
plural nouns are not handled by the stemmer: words ending with –men are the plural of words
ending with –man. Porter stemmer makes other errors concerning the terminal –e. For instance,
“do” is stemmed “do” and “does” is stemmed “doe”. Many exceptions are not controlled: verb
conjugation, possessive nouns, irregular comparative and superlative forms (e.g. good, better,
best), etc. Moreover, more than 5000 suffixes are not handled by Porter such as –ativist, -ativistic,
-ativism, atavistically, –ship, –ist, –atory, –ingly, got, gotton, ound, ank, unk, ook, ept, ew, own,
etc. This would decrease the stemming quality, since related words are stemmed to different
forms. In an information retrieval context, such cases reduce the performance since some useful
documents will not be retrieved: a search for "ability", for example, will not return documents
containing the word "able". This would decrease the efficiency of diverse systems applying
Porter stemmer.
4. CONTRIBUTION: THE NEW PORTER STEMMER
In order to improve Porter stemmer, we studied the English morphology, and used its
characteristics for building the enhanced stemmer. The resultant stemmer to which we will refer
as "New Porter" includes five steps (Figure 2):
4. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
146
recode double suffix to simple suffix
recode pluyrals and simple present
recode past and present participle
recode "y" to "i"
recode remaining simple suffixes
recode stem with -e
Stem
recode stem with -l
Step 1
Step 2
Step 3
Step 4
Step 5
Word
recode irrigular forms
Irrigular forms dictionary
Figure 2. The steps of the new Porter Stemming Algorithm
The first one as in Porter stemmer handles inflectional morphology (plural, verb conjugation,
etc.). The second step treats derivational morphology, it maps complex suffixes (suffixes
compound of more than one suffix) to a single suffix from which they were derived (e.g.
transform the suffix –istic to –ist). The third step deletes simple suffixes (uncompounded
suffixes). The fourth step defines a set of recoding rules to normalize stems. The last step treats
irregular forms that do not follow any pattern.
In the following, we group in different classes all exceptional cases that are not handled by the
original stemmer. For each case, we suggest rules and how many words are concerned by these
rules.
4.1 Class 1
In this class we gather all exceptions related to verb conjugation and pluralisation.
Porter stemmer ignores irregular forms. These forms can be categorized into two types: forms that
do not follow any pattern; for instance "bought" the past participle of "buy". To handle these
cases, we inserted a dictionary in Step 5 containing a list of common irregular forms. The second
category concerns forms that follow a general pattern. For instance, words ending in –feet are
plural form of words ending in –foot. We propose rules to handle this category. Table. 1 presents
the proposed rules and results when applying the two approaches. We give also the number of
unhandled words for each category:
5. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
147
Table 1. Handling words belonging to class 1.
Category Original words
Results using
Porter
Rules
Results using
New Porter
Number of
words
Words ending in -feet
are the plural form of
words ending in -foot
clubfoot /clubfeet clubfoot/clubfeet feet-foot clubfoot/clubfoot 9 words
Words ending in -men
are the plural form of
words ending in -man
drayman/ draymen drayman/ draymen men-man drayman/ drayman 427 words
Words ending in -ci
are the plural form of
words ending in -cus
abacus/ abaci abacus/ abaci ci-cus abacu/ abacu 35 words
Words ending in -eaux
are the plural form of
words ending in -eau
plateau/ plateaux plateau/ plateaux eaux-eau plateau/ plateau 29 words
Words ending in -
children are the plural
form of words ending
in -child
child/children child/children children-child child/child 6 words
Words ending in -
wives are the plural
form of words ending
in -wife
farmwife/ farmwives farmwif/farmwiv -wives-wife farmwif/farmwif 12 words
Words ending in -
knives are the plural
form of words ending
in -knife
knife/ knives knif/ kniv -knives-knife knif/ knif 5 words
Words ending in –
staves are the plural
form of words ending
in -staff
flagstaff/ flagstaves flagstaff/ flagsttav -staves-staff flagstaff/ flagstaff 7 words
Words ending in -
wolves are the plural
form of words ending
in -wolf
werewolf/ werewolves
werewolf/
werewolv
-wolves-wolf
werewolf/
werewolf
4 words
Words ending in -
trices are the plural
form of words ending
in -trix
aviatrix/ aviatrices aviatrix/ aviatric -trices-trix aviatrix/ aviatrix 18 words
Words ending in -mata
are the plural form of
words ending in -ma
chiasma/ chiasmata chiasma/ chiasmata -mata-ma chiasma/ chiasma 108 words
6. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
148
Table 1. (Continued)
Category Original words
Results using
Porter
Rules
Results using
New Porter
Number of
words
Words ending in -ei
are the plural form of
words ending in -eus
clypeus/ clypei clypeu/ clypei -ei-eus clypeu/clypeu 26 words
Words ending in -pi
are the plural form of
words ending in -pus
carpus /carpi carpu /carpi -pi-pus carpu /carpu 17 words
Words ending in -ses
are the plural form of
words ending in -sis
analysis/analyses analysi/analys -sis-s analys/analys 492 words
Words ending in -xes
are the plural form of
words ending in -xis
praxis/praxes praxi/prax -xis-x prax/prax 32 words
Table 1 shows many cases that are not taken into account by Porter. The proposed rules handle
about 1200 exceptional cases that were ignored by the original algorithm.
4.2 Class 2
Porter stemmer does not conflate verbs ending in –s (not –ss ) with their participle forms. The
stem of the infinitive form is obtained by removing the terminal –s. For the past or present
participle, Porter stemmer removes respectively the suffixes –ed and –ing and keeps the terminal
–s. Table 2 presents the proposed rules to handle this category and results when applying the two
approaches.
Table 2. Handling words belonging to class 2.
4.3 Class 3
Porter stemmer does not conflate words ending in –y and that do not contain a vowel with their
derived form. In fact, Porter defines a recoding rule that transforms an –y terminal to –i only if the
word contains a vowel. Another rule is defined to handle words ending in –ies which is ies-> i.
This will conflate carry-carries, marry-marries, etc. However, words such as try-tries-tried are not
Original words Results using Porter Rules
Results using
New Porter
Number
of words
focus/focuses/focused/
focusing
focu/focus/focus/focus
-sed and !(-ssed)s
-sing and !(-ssing)s
focu/focu/focu/
focu
28 verbs
chorus/choruses/
chorused /chorusing
choru/chorus/chorus
/chorus
choru/choru/c
hor/ choru
7. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
149
conflated since the stem does not contain a vowel. Similarly, verbs ending in –ye are not handled
by Porter stemmer. In fact, when a verb ends in –ye, the terminal –e is removed in the last step
and hence, the –y is not replaced by –i. To stem the past or present participle of this verb, the
suffix –ed or –ing will be removed in the first step, leaving the stem with an –y terminal which
will be replaced by –i in the next step and hence, the infinitive form of the verb, its past and
present participle are not conflated. To handle these cases, we propose to eliminate the rule
defined in the first step that transforms a terminal –y to –i if it contains a vowel. Table 3 presents
results when applying the two approaches on a set of words belonging to this category.
Table 3. Handling words belonging to class 3.
4.4 Class 4
Porter stemmer makes errors concerning verbs ending in a double consonant and their derivations.
Thus, if the stem of the present or past participle form ends in a double consonant and that the
consonant is other than ‘l’, ‘s’ or ‘z’, then the stemmer removes a letter and keeps the stem with a
single consonant. In the last step, the stemmer removes a consonant only for words ending in –ll
and for which the m value is greater than 1. This will cause some problems. For instance, "ebbed"
is stemmed to "eb". However, "ebb" is stemmed to "ebb". Hence, the two words are not conflated.
Exceptional cases are all verbs ending in double consonant other than –l, –s and –z. Verbs ending
in –z that double the –z to form the present or past participle are also not treated by Porter
stemmer. Thus, these verbs get their past or present participle by doubling the terminal –z.
Consequently, these verbs will not be conflated with their infinitive form. For instance, "whizzed"
the past participle of "whiz" is stemmed to "whizz"; "whiz" is kept unchanged and hence "whiz"
and "whizzed" are not conflated. To resolve these problems, we propose to redefine the recoding
rule of the last step. Initially, the rule deletes a double consonant if the consonant in question is –l
and that the m value of this stem is greater than 1. The rule will be modified in the way that it
deletes the consonant of all the stems ending in a double consonant. For words ending in –ll, It
removes a consonant if the stem has an m value greater than 1. Table.4 presents results when
applying the two approaches on a set of words belonging to this category.
Table 4. Handling words belonging to class 4
Original words Results using Porter Results using New
Porter
Number of words
cry/cries/cried/crying cry/cri/cri/cry cry/cry/cry/cry
About 20 verbs
dye/dyes/dyed/dying dye/dyes/dyed/dying dy/dy/dy/dy
Original words Results using Porter
Results using New
Porter
Number of words
ebb/ebbed/ebbing ebb/eb/eb eb/eb/eb
160 verbs
add/added/adding add/ad/ad ad/ad/ad
staff/staffed/staffing staff/staf/staf staf/staf/staf
spaz/spazzes/spazzed spaz/spazz/spazz spaz/spaz/spaz
whiz/whizzes/whizzed whiz/whizz/whizz whiz/whiz/whiz
8. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
150
4.5 Class 5
Porter stemmer does not treat present or past participle derivations. For instance, ‘studiedly’ is
stemmed to ‘studiedli’ however ‘study’ is stemmed to ‘studi’. Hence, the two forms are not
conflated. Table.5 presents the proposed rules to handle this category and results when applying
the two approaches on a set of words belonging to this category:
Table 5. Handling words belonging to class 5
4.6 Class 6
Porter stemmer ignores many suffixes such as –est, –ist, –tary, –tor, –sor, –sory, –nor, –ship, –
acy, –ee, etc. We propose new rules to handle these suffixes. Many other compound suffixes are
also ignored by Porter stemmer. To deal with this problem, we propose to generate all possible
compound suffixes derived from each suffix. For instance, suffixes derived from –ate, –ative, –
ativist, –ativistic, –ativism, etc. These suffixes will be then mapped to a common suffix (the
suffix from which all suffixes were derived). The proposed rules handle an important number of
exceptions (more than 5000 exception). In the following we present an extract of the proposed
rules to handle these suffixes:
Table 6. Handling words belonging to class 6 (an extract)
Suffix Proposed rule
Number of words
ending in the suffix
- atization - atization ->-ate 1
- atist - atist ->-ate 20
- atism - atism ->-ate 29
- atic - atic ->-ate 229
- atical - atical ->-ate 30
category Original words Results using Porter Rules
Results using New
Porter
Number of
words
Words ending in –iedly or –
iedness are related to word
ending in -ied
study/studied/
studiedness/
studiedly
studi/studi/studied/studiedli -ly-ied
-ss-ied
study/study/study/study 13 words
Words ending in –edly or –
edness are related to word
ending in –ed
amaze/amazed
/amazedly/
amazedness
amaz/amaz/ amazedli/
amazed
-ly-ed
-ss-ed
amaz/amaz/amaz/
amaz
439 words
Words ending in –ingly or –
ingness are related to word
ending in -ing
amaze/amazing
/amazingly/
amazingness
amaz/amaz/ amazingli/
amazing
-ly-ing
-ss-ing
amaz/amaz/amaz/
amaz
543 words
9. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
151
The new porter stemmer is implemented. In what follows, we propose to evaluate this stemmer
using a standard method inspired by Paice [5].
5. EXPERIMENTAL EVALUATION
5.1 Paice’s evaluation method
The development of stemmers aimed to improve information retrieval performance by
transforming morphologically related terms to a single stem. This infers that an efficacious
stemmer should conflate only pairs of words which are semantically equivalent. The problem is
how the program will judge when two words are semantically equivalent. Paice [5] proposed a
solution to provide an input to the program in the form of grouped files. These files contain list of
words, alphabetically sorted and any terms that are considered by the evaluator to be semantically
equivalent are formed into concept groups. An ideal stemmer should stem words belonging to the
same group to a common stem. If a stemmed group includes more than one unique stem, then the
stemmer has made understemming errors. However, if a stem of a certain group occurs in other
stemmed groups, the stemmer has made overstemming errors. This allows the computation of the
Overstemming and Understemming Indexes (UI and OI) and their ratio, the stemming weight
(SW) for each stemmer.
The Understemming and Overstemming Indexes are metrics of specific errors that occur during
the implementation of a stemming algorithm. According to these metrics, a good stemmer should
produce as few understemming and overstemming errors as possible. However, they cannot be
considered individually during results analysis. To determine the general relative accuracy of the
stemmers, Paice defines a measure, called Error Rate Relative to Truncation (ERRT). It is useful
for deciding on the best overall stemmer in cases where one stemmer is better in terms of
understemming but worse in terms of overstemming. To calculate the ERRT, a baseline is used. It
is obtained by performing the process of length truncation and reducing every word to a given
fixed length. Paice estimates that length truncation is the crudest method of stemming, and he
expects any other stemmer to do better. To do so, Paice proposed to determine values of UI and
OI for a series of truncation lengths. This defines a truncation line against which any stemmer can
be assessed. Any reasonable stemmer will give an (OI, UI) point P between the truncation line
and the origin. The further away the point is from the truncation line, the better the stemmer is.
The ERRT is obtained by extending a line from the origin O through the (OI, UI) point P until it
intersects the truncation line at T, ERRT is then defined as: ERRT= length (OP)/length (OT)
To apply the Paice evaluation method, lists of grouped word files are required. We used two word
lists downloaded from the official website for Paice et Hooper [6]. The first list (Word List A)
was initially used by Paice [5] contains about 10000 words. The sample of words was taken from
document abstracts from the CISI test collection which is concerned with Library and Information
Science. The second list (Word List B) refers to a larger grouped word set (about 20000)
compiled from word lists used in Scrabble word checkers.
In this work, we make tests initially with the Porter’s original stemmer and the improved version
of the stemmer in order to evaluate our approach. We also ran tests with the Paice/husk and
Lovins stemmers. The results of these tests are presented in Table 7.
10. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
152
Table 7. Stemming results using the two versions of the stemmers
5.2 Discussion
Comparing Porter stemmer to New Porter, the relative values of indexes are summarized as
follows:
UI (Porter)> UI (New Porter)
OI (New Porter)> OI (Porter)
ERRT (Porter)> ERRT (New Porter)
The value of understemming for Porter indicates that it leaves much more words understemmed
than New Porter. For instance words such as "ability" and "able" are not conflated when using
Porter stemmer which is not true for New Porter. This will reduce Understemming Index and
conflate much more related words than Porter especially when the two word lists contain many
words ending in suffixes that are not treated by the original stemmer. Hence, the UI value is
improved by the proposed approach. On the other hand the OI value for New Porter is higher than
that of the original stemmer. This is the consequence of the New Porter which removes an
important number of suffixes affecting a lower number of words. In fact, these rules tend to
improve the UI value which hurt OI and generate more overstemming errors.
For a more detailed analysis, we analyse the structure of word list A and Word List B. We find
that an important number of words are related to the derivational morphology. Porter ignores
many derivational suffixes such as –est, –ship, –ist, –tor, –ionally, –antly, –atory, etc. All of the
words ending with these suffixes are not conflated with other related forms in the original version
of the stemmer. This problem was resolved in the stemmer proposed version. The proposed
stemmer handles very well derivations and inflections.
The ERRT is the general measure used by Paice (1994) to evaluate the accuracy of a stemmer.
According to this value, the best stemmer would have the lowest ERRT value compared to the
rest. So, if we take ERRT as a general indicator of performance accuracy, we would have to
conclude that New Porter is a better stemmer than Porter. This means that the (OI, UI) point of
New Porter is farther than the (OI, UI) point of Porter from the truncated line. Consequently,
Porter stemmer generates more errors than the New Porter stemmer. We state also a large
improvement in the ERRT values: about 27% for word list A and about 43% for Word List B.
This means that the proposed approach performs much better than the original version.
Comparing our stemmer to the other approaches (Lovins and Paice/HUSK stemmer), we find that
the new stemmer not only performs better than the original version but also it is more accurate
than Paice/HUSK and Lovins stemmers. In fact, the differences in error rate values (ERRT) are so
important (about 66% with Paice and 95% for Lovins). Regarding the stemmer strength, New
Porter is lighter than the Paice/Husk stemmer since it has a lower SW value. This is beneficial for
the information retrieval task since this would improve precision. Hence, less useless information
is retrieved.
Word List A Word List B
UI OI SW ERRT UI OI SW ERRT
New
Porter
0.1882 0.0000518 0.0002753 0.59 0.1274 0.0000441 0.0003464 0.44
Porter 0.3667 0.0000262 0.0000715 0.75 0.3029 0.0000193 0.0000638 0.63
Paice/Husk 0.1285 0.0001159 0.0009020 0.57 0.1407 0.0001385 0.0009838 0.73
Lovins 0.3251 0.0000603 0.0001856 0.91 0.2806 0.0000736 0.0002623 0.86
11. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
153
5.3 Evaluation in Information Retrieval
The new stemming algorithm described in the previous section is evaluated in textual retrieval
field. We implemented a traditional document retrieval system to retrieve a list of documents
ranked by their relevance to a query. The system is based on VSM [7], and tf-idf weighting [8],
[9].
To evaluate our system we used a corpus of 400 MEDLINE (Medical Literature, Analysis, and
Retrieval System Online). MEDLINE is a bibliographic database of the National Library of
Medicine, enclosing more than 19 million bibliographic articles, accessible via PubMed [10].
For the comparison, we apply the information retrieval method, first without using a stemmer,
second using the original Porter stemmer, and finally using the new Porter Stemmer. We used the
two well-known metrics, namely recall and precision, calculated as follows:
correct
retrieved
correct
recall
=
retrieved
retrieved
correct
precision
=
A good information retrieval system should retrieve several relevant documents (have a high
recall), and it should retrieve few non-relevant documents (have high precision). As shown in
Table 8, the document retrieval system, using the same evaluation documents, outperformed
using Porter stemmer, compared to document retrieval system without using a stemmer.
Since the stem of a term represents a larger concept than the original term, the stemming process
increases the number of retrieved documents. When the document retrieval system uses the new
porter stemmer we perceive an improvement in retrieval effectiveness compared to the original
Porter stemmer.
Table. 8. Results in information retrieval
Used stemmer in IR Precision Recall
Without Stemmer 0.661 0.671
With original Porter Stemmer 0.732 0.775
With new Porter Stemmer 0.852 0.884
6. CONCLUSION
This paper suggests an improved version of Porter stemmer for English. The stemmer was
evaluated using the Paice evaluation method. In these experiments, we used two grouped word
lists of 10000 and 20000 words respectively. Promising results are obtained: thus, the New Porter
stemmer performs much better than the original stemmer and other English stemmers.
The new stemmer was also evaluated with an information retrieval system. The obtained results
are encouraging: the precision and the recall are improved in an information retrieval system
using the new version of porter stemmer compared to an information retrieval system using the
original version.
12. International Journal of Network Security & Its Applications (IJNSA), Vol.5, No.4, July 2013
154
We conclude that the New Porter stemmer is relatively a light stemmer and hence it seems to be
appropriate for the information retrieval task. To further confirm the robustness of the stemming
algorithm, a perspective to this work is to evaluate the New Porter stemmer in other Natural
Language processing areas. The New Porter stemmer will be used in order to improve automatic
text summarization, document clustering, information extraction, document indexing, Question-
Answering systems, etc. The new algorithm can be useful for words ‘normalization as well as
reducing the space representation.
Since implementations of this algorithm are available in diverse languages, we plan to produce
new versions in further languages to process multilingual corpora.
REFERENCES
[1] Lovins, J.B. (1968). Development of a stemming algorithm. Journal of Mechanical Translation
and Computational Linguistics, Vol. 11,No. 1 and 2, pp. 22-31.
[2] Paice, C.D. (1990). Another stemmer. SIGIR Forum. Vol. 24, No.3, pp.56-61.
[3] Porter, M.F. (1980). An Algorithm for Suffix Stripping. The journal Program, Vol. 14, No. 3, pp.
130-137.
[4] Porter, M.F. (2006). An algorithm for suffix stripping. Program: electronic library and information
systems, Vol. 40 Iss: 3, pp.211 – 218.
[5] Paice, C.D. (1994). An evaluation method for stemming algorithms. In Proceedings of the 7th
Annual Intl, ACM-SIGIR Conference, Dublin, Ireland, pp.42–50.
[6] Paice, C., Hooper, R. (2005). Available at:
http://www.comp.lancs.ac.uk/computing/research/stemming/Links/program.htm
[7] Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing.
Communications of the ACM, Vol. 18. No. 11, pp. 613–620.
[8] Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
Information Processing and Management, Vol. 24, No. 5, pp. 513–523.
[9] Turney, P.D., Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics
Journal of Artificial Intelligence Research No. 37, pp. 141-188.
[10]NLM: National Library of Medicine PubMed Internet (June 2012),
http://www.ncbi.nlm.nih.gov/pubmed.
Authors
Wahiba Ben Abdessalem: is a professor in the Department of Computer and Information
Science at university of Tunis, at Higher Institute of Management. She received the
Master Degree in 1992 from Paris III,
New Sorbonne, France, and PhD, from Paris 7 Jussieu France in 1997. Her researches
interest Natural language processing, document annotation, information retrieval, text mining.
She is a member of program committee of several International Conferences: ICCA'2010, ICCA'2011,
ICCA'2013, RFIW 2011, ICCAT'2013, ICCRK’2013, ICMAES'2013, … and a member of the Editorial
Board of several journals: the International Journal of Managing Information Technology (IJMIT), the
journal International journal of Chaos, Control, Modelling and simulation (IJCCMS)…