Stemming is the process of clipping off the affixes from the input word to obtain the respective root word, but it is not necessary that stemming provide us the genuine and meaningful root word. To overcome this problem we come up with a solution- Lemmatizer. It is the process by
which we crave out the lemma from the given word and can also add additional rules to make the clipped word a proper stem. In this paper we have created an inflectional lemmatizer which generates the rules for extracting the suffixes and also added rules for generating a proper
meaningful root word.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
Morphological stemming becomes a critical step toward natural language processing. The process of stemming is to reduce alternative forms to a common morphological root. Word segmentation for Myanmar Language, like for most Asian Languages, is an important task and extensively-studied sequence labelling problem. Named entity detection is one of the issues in Asian Language that has traditionally required a large amount of feature engineering to achieve high performance. The new approach is integrating them that would benefit in all these processes. In recent years, end-to-end sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks. We trained the model using manually annotated corpora. State-of-the-art named entity recognition systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that relies on two sources of information: character level representation and syllable level representation.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...ijnlc
With the advent of social media, the amount of text available for processing across different natural
languages has become enormous. In the past few decades, there has been tremendous increase in the
number of language processing applications. The tools for natural language computing of various
languages are very different because each language has its own set of grammatical rules. This paper
focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of
word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for
word inflections in Tamil can be grouped in these three levels and applied for obtaining different word
forms is the focus of this paper. These can be made use of in a wide variety of natural language
applications like morphological analysis, morphological generation, word level translation, spelling and
grammar check, information extraction etc. The tools using these rules will account for faster operation
and better implementation of Tamil grammatical rules referred from [த ொல்த ொப்பியம் |
tholgaappiyam] and [ நன்னூல் | nannool] in NLP applications.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
Sanskrit in Natural Language ProcessingHitesh Joshi
As Sanskrit is most unambiguous language as compare to other natural languages. As stated by Rick Briggs, NASA it is the most suitable language for the computer in natural language processing.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
Morphological stemming becomes a critical step toward natural language processing. The process of stemming is to reduce alternative forms to a common morphological root. Word segmentation for Myanmar Language, like for most Asian Languages, is an important task and extensively-studied sequence labelling problem. Named entity detection is one of the issues in Asian Language that has traditionally required a large amount of feature engineering to achieve high performance. The new approach is integrating them that would benefit in all these processes. In recent years, end-to-end sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks. We trained the model using manually annotated corpora. State-of-the-art named entity recognition systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that relies on two sources of information: character level representation and syllable level representation.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...ijnlc
With the advent of social media, the amount of text available for processing across different natural
languages has become enormous. In the past few decades, there has been tremendous increase in the
number of language processing applications. The tools for natural language computing of various
languages are very different because each language has its own set of grammatical rules. This paper
focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of
word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for
word inflections in Tamil can be grouped in these three levels and applied for obtaining different word
forms is the focus of this paper. These can be made use of in a wide variety of natural language
applications like morphological analysis, morphological generation, word level translation, spelling and
grammar check, information extraction etc. The tools using these rules will account for faster operation
and better implementation of Tamil grammatical rules referred from [த ொல்த ொப்பியம் |
tholgaappiyam] and [ நன்னூல் | nannool] in NLP applications.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
Sanskrit in Natural Language ProcessingHitesh Joshi
As Sanskrit is most unambiguous language as compare to other natural languages. As stated by Rick Briggs, NASA it is the most suitable language for the computer in natural language processing.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
Myanmar word sorting is very important in indexing of search engine to optimize in the searching process of keywords. This paper proposed an efficient sorting algorithm for Myanmar words based on the weights of consonants, vowels, devowelizers, and consonant combination of each syllable of the words since Myanmar words are composed of one syllable or more than one syllable and finally the words are sorted based on Quick sort. The proposed algorithm is intended to design for Zawgyi_One font, which is mainly dominant in Myanmar Web pages.
Anaphora resolution in hindi language using gazetteer methodijcsa
Anaphora resolution is one of the active research areas within the realm of natural language processing.
Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This
paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various
methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution
in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies
operations to classify elements present in the list. There are many salient factors for resolving anaphora.
The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor
always represent living things and non living things whereas Recency describes that the referents
mentioned in current sentence tends to have higher weights than those in previous sentence. This paper
demonstrate the experiments conducted on short Hindi stories ,news articles and biography content from
Wikipedia, its result & future directions to improve accuracy.
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
Myanmar word sorting is very important in indexing of search engine to optimize in the searching process of keywords. This paper proposed an efficient sorting algorithm for Myanmar words based on the weights of consonants, vowels, devowelizers, and consonant combination of each syllable of the words since Myanmar words are composed of one syllable or more than one syllable and finally the words are sorted based on Quick sort. The proposed algorithm is intended to design for Zawgyi_One font, which is mainly dominant in Myanmar Web pages.
Anaphora resolution in hindi language using gazetteer methodijcsa
Anaphora resolution is one of the active research areas within the realm of natural language processing.
Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This
paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various
methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution
in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies
operations to classify elements present in the list. There are many salient factors for resolving anaphora.
The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor
always represent living things and non living things whereas Recency describes that the referents
mentioned in current sentence tends to have higher weights than those in previous sentence. This paper
demonstrate the experiments conducted on short Hindi stories ,news articles and biography content from
Wikipedia, its result & future directions to improve accuracy.
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu &
Marathi language). Especially, during last few years, a wide range of information in Indian regional
languages has been made available on web in the form of e-data. But the access to these data repositories
is very low because the efficient search engines/retrieval systems supporting these languages are very
limited. Hence automatic information processing and retrieval is become an urgent requirement. To train
the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating
suffix rules two different approaches, namely, frequency based stripping and length based stripping have
been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The
experiment results shows that in the case of Urdu language the frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix stripping algorithm.
Word morphology is a process of analysing word formation. Morphological analysis is one of the pre-processing steps in natural language processing tasks. Few studies have looked at Setswana noun morphology analysis and generation computationally. In this paper we present a rule-based Setswana noun morphological analyzer and generator. The analyser and generator implement morphological rules which are supported by a dictionary of root words with some attributes. Results show that Setswana nouns could mostly be analysed using morphological rules and the rules could also be used to generate other words. Adjectives, pronouns, adverbs and enumeratives are also included. The generator shows that Setswana nouns, adjectives and adverbs are less productive compared to verbs. The analyzer gives a 79% performance rate and the generator 92%. The analyser rules fail when multiple words have the same intermediate word and with homographs. The generator failures are due to over generation and under generation.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration. We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...kevig
With the advent of social media, the amount of text available for processing across different natural languages has become enormous. In the past few decades, there has been tremendous increase in the number of language processing applications. The tools for natural language computing of various languages are very different because each language has its own set of grammatical rules. This paper focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for
word inflections in Tamil can be grouped in these three levels and applied for obtaining different word forms is the focus of this paper. These can be made use of in a wide variety of natural language applications like morphological analysis, orphological generation, word level translation, spelling and grammar check, information extraction etc. The tools using these rules will account for faster operation and better implementation of Tamil grammatical rules referred from [த ொல்த ொப்பியம் |
tholgaappiyam] and [ நன்னூல் | nannool] in NLP applications.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
Morphological stemming becomes a critical step toward natural language processing. The process of
stemming is to reduce alternative forms to a common morphological root. Word segmentation for
Myanmar Language, like for most Asian Languages, is an important task and extensively-studied
sequence labelling problem. Named entity detection is one of the issues in Asian Language that has
traditionally required a large amount of feature engineering to achieve high performance. The new
approach is integrating them that would benefit in all these processes. In recent years, end-to-end
sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks.
We trained the model using manually annotated corpora. State-of-the-art named entity recognition
systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that
relies on two sources of information: character level representation and syllable level representation.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
The objective of the research is to classify the serial-verb constructions in Thai automatically by using the
word classes from Thai WordNet to classify verbs in the sentence. Due to the Thai language has the extendto-
the-right structure and put the adjective after the noun. Its overall grammar characteristic is the
"Subject-Verb-Object" or SVO type. And Thai language can be communicated using one verb after another
within the same sentence, that we called "Serial Verb". Today we already have many researches about this
serial-verb constructions, but no research is about its automatic classification.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Similar to DESIGN OF A RULE BASED HINDI LEMMATIZER (20)
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
2. 68 Computer Science & Information Technology (CS & IT)
word गल तय which when stemmed gives गल तय, but it is not a proper root. Over-stemming is a
case which occurs when the words need not to be grouped together, but are actually grouped. It
occurs mainly in case of semantically distant words. For example – the Hindi word चहेता when
stemmed gives us चहे which is again not a proper root word. Thus we can say that stemmer does
not provide us the contextual knowledge which is provided by the lemmatizer. Since English and
other European languages are not highly inflected therefore there have been many stemmers and
lemmatizers developed for them, but if we talk about Indian languages, they are highly inflected.
Lemmatizer for such languages is rarely found. In this paper we are emphasizing on Hindi
language. Hindi is the official language of India. It is most widely spoken in almost all the parts
of the country. So, in order to preserve the language and its root words we have developed a
lemmatizer which contains various rules. The study however does not include all the rules but can
be taken as a prototype for extending the functionality of the system. We have made an attempt to
make an automated lemmatizer using the rules. This system can be efficiently used for retrieving
the information.
The paper is organized as follows: the first section described the introduction and problems of
stemming. The next section discusses the literature work done in this research area. Further we
discuss the linguistic background of Hindi. Then we propose our work in which we have shown
the methodology and approach that we have used. Further we have shown the processing
technique in which we have discussed some examples and provide some limitations of the system
depending on the evaluation result. The final section concludes the study of the paper.
2. RELATED WORK
A lot of research work has been done and is still going on for the development of a stemmer as
well as lemmatizer. The first stemmer was developed by Julie Beth Lovins [1] in 1968. Later the
stemmer was improved by Martin Porter [2] in July, 1980 for English language. The proposed
algorithm is one of the most accepted methods for stemming where automatic removal of affixes
is done from English words. The algorithm has been implemented as a program in BCPL. Much
work has been done in developing the lemmatizer of English and other European languages. In
contrast, very little work has been done for the development of lemmatization for Indian
languages.
A rule based approach proposed by Plisson et al. [3] is one of the most accepted lemmatizing
algorithms. It is based on the word endings where the suffix should be removed or added to get
the normalized form. It emphasizes on two word lemmatization algorithm which is based on if-
then rules and the ripple down approach. The work proposed by Goyal et al. [6] focuses on the
development of a morphological analyzer and generator. They aimed to develop a translation
system especially from Hindi to Punjabi. Nikhil K V S [8] built a Hindi derivational analyzer
using a specific tool. He used supervised approach by creating a SVM classifier. Jena et al. [9]
proposed a morphological analyzer for Oriya language by using the paradigm approach. They
classified nouns, adjectives and finite verbs of Oriya by using various paradigm tables. Anand
Kumar et al. [7] developed an automatic system for the analysis of Tamil morphology. They used
various methodologies, rule based approach and sequence labeling containing the non linear
relationships of morphological features from the training data in a better way.
3. Computer Science & Information Technology (CS & IT) 69
Chachoo et al. [5] used an extract tool named Extract v2.0 for the development of the
orthographic component of Kashmiri Script. A method has been proposed by Majumder et al.
[12] in which a clustering based approach is used for discovering the equivalent classes of root
words. This algorithm was tested for two languages French and Bangla. A rule based approach
for stemming in Hindi was proposed by Ramanathan & Rao [11]. The approach is based on
stripping off suffixes by generating rules emphasizing on noun, adjective and verb inflections in
Hindi. Bharti Akshar et al. [4] proposed the work on natural language processing where they
gave a detailed study of morphology using paradigm approach.
3. LINGUISTIC BACKGROUND OF HINDI
Morphemes play a vital role in lemmatization. This is the major way in which morphologists
investigate the words. Their formation and internal structure is studied in deep. Morphology is
broadly categorized into two parts: derivational morphology and inflectional morphology.
Derivational morphology processes the words and form new lexemes from the existing ones. This
is done by either adding or deleting affixes. For example – सजा + वट= सजावट. Here the word
class is changed from adjective to noun. Similarly, in English we have words like employ + ee =
employee, where the word class is changed from verb to noun. Inflectional morphology processes
the words by producing various inflections without changing the word class. For example – कलम
+ दान = कलमदान where both कलम and कलमदान is noun/singular. The word class remains same
here. The root form of the words basically comes under noun and verb classes. This knowledge
lead us to trace the paradigm approach. According to Smriti Singh and Vaijayanthi M Sarma [10],
Hindi noun classification system shows only the number and case for morphological analysis.
Number basically includes either singular or plural. By default a word is kept singular. In Hindi
we have two types of Cases – direct and oblique. Oblique words show the case as well as the
number of the word. For example – लड़क – ◌ा, लड़क – ◌े, here ◌ा shows singular number whereas
◌े shows plural number. Similarly we have some gender rules. In Hindi, words ending with the
suffix ◌ी are termed as feminine whereas the words that end with suffix ◌ा are termed as
masculine. For example लड़का, नेता, घोड़ा, कटोरा, ब चा and many more are masculine ending with
◌ा while लड़क , धोबी,पु ी, कटोर , ब चीand many more are feminine ending with ◌ी. But we have
many more words that contradict this concept. For example – the word पानी (water) is masculine,
although it is ending with ◌ी. Similarly the word माला (garland) is feminine even though ending
with ◌ा. There are some other words from which the suffix cannot be removed. For example – let
us consider the suffix ◌ा, the words पता, माता, ब चा, कटोरा, नेता, and many more does not
require stemming. Such words need to be maintained as it is and should be refrained from being
stemmed. So it is found that Hindi is a highly inflected language and needs a deep study of word
structure and its formation.
4. PROPOSED WORK
In this paper we have discussed about the creation of a Hindi lemmatizer. The approach for the
creation is based on the key concept of optimization. Optimization includes both space and time,
4. 70 Computer Science & Information Technology (CS & IT)
so our approach is based on these parameters. The lemmatizer that we discuss here mainly
focuses on the time complexity problem. Typically lemmatizer is built using a rule based
approach and paradigm approach. In rule based approach along with the rules, knowledgebase is
created for storing the grammatical features. Knowledgebase is also created for storing the
exceptional root words. That is we need some root words as it is containing the suffix. Although
the knowledgebase creation requires a large amount of memory, but in respect of time it gives us
the best, accurate and fast result. The reason behind this fast retrieval is that, a very short time is
taken to search the input word from the knowledgebase. The study [7] shows that Tamil words
have infinite set of inflections but Hindi words have finite set of inflections which are quite easy
to maintain in the knowledgebase. We have restricted our knowledgebase to commonly used
words which do not contain the proper nouns like the names of person and place.
4.1 SUFFIX GENERATION
We have gone through various words with their suffixes and examined the morphological
changes for the development of a lemmatizer. These suffixes and changes led to the development
of specific rules. For example – If we take the word खराबी (defect) then we find that the word is
derived by adding ◌ी suffix to the word खराब (bad) which transform noun to adjective. Similarly
there are many other words with the same suffix. Some of them are shown in table 1 and 2.
Table 1. Example of derived words with suffix ◌ी & ई (noun to adjective)
Root Word Derived Word
साफ़ सफ़ाई
ऊँ चा ऊँ चाई
मोटा मोटाई
गर ब गर बी
सद सद
Table 2. Some more suffixes are as follows
Root Word Derived Word Suffix
गाड़ी गा ड़य य
मीठा मठाई ई
प व प व ता ता
जादू जादूगर गर
रोशन रोशनदान दान
चढ़ चढ़ाई ◌ाई
Hindi words are large in number and for this reason the extraction of suffixes had a large list.
Since the work has been done manually therefore this phase was quite time consuming. The
5. Computer Science & Information Technology (CS & IT) 71
suffixes were generated by processing a corpus of 40,000 sentences from which 75 lakh words
were manually stemmed among them 124 suffixes were derived.
4.2 RULES GENERATION
After generating the suffix list we have developed rules. We have created 124 rules which are
framed in such a way that the suffix gets removed from the input word and if required, addition of
character or ‘maatra’ takes place. For example – let us take the suffix ◌ो◌ं. Some of the words
containing this suffix are shown in table 3.
Table 3. Words showing the suffix ◌ो◌ं
Rule application
Word Root Extraction of
suffix
Addition of character
लड़क लड़का ◌ो◌ं ◌ा
सड़क सड़क ◌ो◌ं __
लेखक लेखक ◌ो◌ं __
In the above table on removing the suffix ◌ो◌ं we get their respective root word, but the word
लड़क is an exception here because on removing the suffix ◌ो◌ं we need to add ◌ा to the last letter
of the word to make it a genuine root word’ लड़का.’ Similarly there are many other rules for
removing the suffix and if necessary addition of character may also takes place. Similarly we also
have some other rules, like the rule for extracting the suffix ि◌य which is shown in table 4.
Hindi has a grammar rule, according to which when the plural is removed, we need to add ◌ी to
the last letter of the word. Table 4 mentions the rule for the suffix ि◌य in which we have created
a general rule for removing the suffix and adding ◌ी to the word, but we have some exceptions
here which include the addition of ि◌ instead of ◌ी. In the below table we have also shown an
exception in the last word च ड़य where the root form is च ड़या. The word च ड़य contains two
suffixes together which are ि◌य and ◌ो◌ं. This becomes hard for the system as it finds difficulty
in picking up the correct rule for the particular word. Similarly there are many more exceptions
for which we have generated different rules. To overcome such problems we have built a
database in which such exceptional words are kept. Although this work requires much time but
for the sake of fast and accurate result this approach is applied.
6. 72 Computer Science & Information Technology (CS & IT)
Table 4. Words showing the suffix ि◌य
Rule application
Word Root Extraction of suffix Addition of character
लड़ कय लड़क ि◌य ◌ी
कहा नय कहानी ि◌य ◌ी
क वय क व ि◌य ि◌ (exception)
च ड़य च ड़या ि◌य ि◌या (exception)
5. PROCESSING TECHNIQUE
Fig 1. Schematic diagram of the system
The foremost step is to read the input word. The database contains all the root words. The input is
checked in the database, if it is present in the database then the word is displayed as it is. If the
word is not present in the database then it comes down to access the rules. After accessing the
rules the root word is generated and displayed. The rule is followed as -
If (root) present in (root list)
{
Fetch the root from the list
Display;
}
else if (root) not present in (root list)
{
If (source) ends with (suffix)
{
7. Computer Science & Information Technology (CS & IT) 73
Substring the source
Display the root;
}
}
6. ILLUSTRATION
As an illustration we gave a set of Hindi words in order to analyze the output. Some of them are –
Table 5. Stemmed Output
Input Output
च ड़य च ड़या
लड़ कय लड़क
भारतीयता भारत
तभाशाल तभा
गौरवां वत गौरव
7. EVALUATION
The system is evaluated for its accuracy where we gave 2500 words for analysis. Among these
2500 words 2227 words were evaluated correctly and 273 words were incorrect because they
violated both the exceptional and general rules.
Fig. 1. Graphical representation of evaluation
Our system gave 89.08% accuracy. We used the following formula to calculate the accuracy.
Accuracy % = Total correct lemmas / Total words
8. CONCLUSION
In this paper we have discussed the development of Hindi lemmatizer. The work focuses on rule
based approach along with this paradigm approach is also used in which we have created
knowledgebase containing all the Hindi root words that are commonly used in day to day life.
The main aim is emphasized on time optimization problem rather than on space. Since nowadays
8. 74 Computer Science & Information Technology (CS & IT)
space is not at all a big problem, therefore our approach aimed to optimize time and generate
accurate result in a very short period. Our system gave 89.08% of accuracy.
REFERENCES
[1] Julie Beth Lovins, Development of stemming Algorithm, Mechanical Translation and Computational
Linguistics, Vol. 11, No. 1, pp 22-23, 1968.
[2] Martin F. Porter, An algorithm for suffix stripping, Program, Vol. 14, No. 3, pp 130-137,1980.
[3] Plisson, J, Larc, N, Mladenic, D.: A Rule based approach to word lemmatization, Proceedings of the
7th International Multiconference Information Society, IS-2004, Institut Jozef Stefan, Ljubljana, pp.
83-86, 2008.
[4] Bharti Akshar, Vineet Chaitanya, Rajeev Sangal, Natural Language Processing: A Paninian
Perspective. Prentice-Hall of India, 1995.
[5] Manzoor Ahmad Chachoo, S.M.K. Quadri. Morphological Analysis from the Raw Kashmiri Corpus
Using Open Source Extract Tool, Vol. 7, No. 2, 2011.
[6] Vishal Goyal, Gurpreet Singh Lehal, Hindi Morphological Analyzer and Generator, IEEE Computer
Society Press, California, USA pp. 1156-1159, 2008.
[7] Anand Kumar M, Dhanalakshmi V, Sonam K P, A sequence labeling approach to morphological
analyzer for tamil language, International Journal on Computer Science and Engineering, Vol. 20, No.
06, 2010.
[8] Nikhil K V S, Hindi derivational morphological analyzer, Language Technologies Research Center,
IIIT Hyderabad, 2012.
[9] Itisree Jena, Sriram Chaudhary, Himani Chaudhary, Dipti M. Sharma, Developing Oriya
Morphological Analyzer Using Lt-toolbox, ICISIL 2011, CCIS 139, pp. 124-129, 2011.
[10] Smriti Singh, Vaijayanthi M Sarma. Hindi Noun Inflection and Distributed Morphology.
[11] A. Ramanathan and D.D Rao, A Light Weight Stemmer for Hindi, In Proceedings of Workshop on
Computational Linguistics for South Asian Languages, 10th Conference of the European Chapter of
Association of Computational Linguistics, pp. 42-48, 2003.
[12] Prasenjit Majumder, Mandar Mitra, Swapan k. Pauri, Gobinda Kole, Pabitra Mitra and Kalyankumar
Datta, YASS: Yet Another Suffix Stripper, ACM Transactions on Information Systems, Vol.25,
NO.4, pp. 18-38, 2007.