Stemming is the process of clipping off the affixes from the input word to obtain the respective root word, but it is not necessary that stemming provide us the genuine and meaningful root
word. To overcome this problem we come up with a solution- Lemmatizer. It is the process by which we crave out the lemma from the given word and can also add additional rules to make
the clipped word a proper stem. In this paper we have created an inflectional lemmatizer which generates the rules for extracting the suffixes and also added rules for generating a proper meaningful root word.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...ijnlc
With the advent of social media, the amount of text available for processing across different natural
languages has become enormous. In the past few decades, there has been tremendous increase in the
number of language processing applications. The tools for natural language computing of various
languages are very different because each language has its own set of grammatical rules. This paper
focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of
word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for
word inflections in Tamil can be grouped in these three levels and applied for obtaining different word
forms is the focus of this paper. These can be made use of in a wide variety of natural language
applications like morphological analysis, morphological generation, word level translation, spelling and
grammar check, information extraction etc. The tools using these rules will account for faster operation
and better implementation of Tamil grammatical rules referred from [த ொல்த ொப்பியம் |
tholgaappiyam] and [ நன்னூல் | nannool] in NLP applications.
Development of morphological analyzer for hindiMade Artha
This document presents the development of a morphological analyzer for the Hindi language. It discusses the importance of morphological analysis in natural language processing. Hindi is described as a morphologically rich, inflected language capable of generating hundreds of words from root words. The proposed analyzer will take in single Hindi words or sentences and return the root word along with features like part of speech, gender, number, and person. It will use both a rule-based approach and corpus-based approach, as previous analyzers have not worked for both inflectional and derivational morphemes. The document outlines the structure of Hindi words and sentences to provide context for developing the analyzer.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
Sanskrit in Natural Language ProcessingHitesh Joshi
As Sanskrit is most unambiguous language as compare to other natural languages. As stated by Rick Briggs, NASA it is the most suitable language for the computer in natural language processing.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...ijnlc
With the advent of social media, the amount of text available for processing across different natural
languages has become enormous. In the past few decades, there has been tremendous increase in the
number of language processing applications. The tools for natural language computing of various
languages are very different because each language has its own set of grammatical rules. This paper
focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of
word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for
word inflections in Tamil can be grouped in these three levels and applied for obtaining different word
forms is the focus of this paper. These can be made use of in a wide variety of natural language
applications like morphological analysis, morphological generation, word level translation, spelling and
grammar check, information extraction etc. The tools using these rules will account for faster operation
and better implementation of Tamil grammatical rules referred from [த ொல்த ொப்பியம் |
tholgaappiyam] and [ நன்னூல் | nannool] in NLP applications.
Development of morphological analyzer for hindiMade Artha
This document presents the development of a morphological analyzer for the Hindi language. It discusses the importance of morphological analysis in natural language processing. Hindi is described as a morphologically rich, inflected language capable of generating hundreds of words from root words. The proposed analyzer will take in single Hindi words or sentences and return the root word along with features like part of speech, gender, number, and person. It will use both a rule-based approach and corpus-based approach, as previous analyzers have not worked for both inflectional and derivational morphemes. The document outlines the structure of Hindi words and sentences to provide context for developing the analyzer.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
Sanskrit in Natural Language ProcessingHitesh Joshi
As Sanskrit is most unambiguous language as compare to other natural languages. As stated by Rick Briggs, NASA it is the most suitable language for the computer in natural language processing.
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiIJERA Editor
This document discusses reduplication in Hindi and proposes developing a tool to identify and translate reduplicated words from Hindi to Punjabi. It begins by defining reduplication and describing different types, including morphological, lexical, echo formation, compounds, word and discontinuous reduplication. It then states the problem of automatically identifying echo words in Hindi conversations to translate to Punjabi. Next, it discusses using rule-based and statistical approaches and choosing rule-based. Finally, it outlines the work to be done, including studying reduplication, past research, translating echo words, and rule-based approaches.
This document describes a rule-based machine translation system for translating English text to Telugu. It discusses the challenges of developing such a system, including differences in grammar between the two languages. An algorithm is proposed that uses rules, probabilities, and rough sets to classify sentences and select the best word translations. The system works by tokenizing English sentences, tagging the words with parts of speech, looking up word translations in a bilingual dictionary, and concatenating the Telugu words to form the output sentence.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
Myanmar word sorting is very important in indexing of search engine to optimize in the searching process of keywords. This paper proposed an efficient sorting algorithm for Myanmar words based on the weights of consonants, vowels, devowelizers, and consonant combination of each syllable of the words since Myanmar words are composed of one syllable or more than one syllable and finally the words are sorted based on Quick sort. The proposed algorithm is intended to design for Zawgyi_One font, which is mainly dominant in Myanmar Web pages.
Anaphora resolution in hindi language using gazetteer methodijcsa
Anaphora resolution is one of the active research areas within the realm of natural language processing.
Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This
paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various
methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution
in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies
operations to classify elements present in the list. There are many salient factors for resolving anaphora.
The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor
always represent living things and non living things whereas Recency describes that the referents
mentioned in current sentence tends to have higher weights than those in previous sentence. This paper
demonstrate the experiments conducted on short Hindi stories ,news articles and biography content from
Wikipedia, its result & future directions to improve accuracy.
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
This document describes a Marathi text-to-speech (TTS) synthesis system based on a concatenative approach using syllables as the basic speech units. The system analyzes input text, performs syllabification based on linguistic rules, retrieves corresponding speech files from a corpus, concatenates the files while minimizing discontinuities at boundaries, and outputs synthesized speech. A Marathi speech corpus was created containing over 1000 sentences from various domains. Subjective quality tests found the synthesized speech to have naturalness and intelligibility comparable to natural speech. The system demonstrates an effective approach for Marathi TTS using a syllable-based concatenative method.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
The document presents an unsupervised approach for developing stemmers for Urdu and Marathi languages. It uses frequency-based and length-based stripping approaches to automatically learn and generate suffix rules from word corpora without any linguistic knowledge. The frequency-based approach achieved 85.36% accuracy for Urdu and 82.5% for Marathi, outperforming the length-based approach. The stemmer could improve information retrieval for languages by reducing morphological variants to word stems.
Word morphology is a process of analysing word formation. Morphological analysis is one of the pre-processing steps in natural language processing tasks. Few studies have looked at Setswana noun morphology analysis and generation computationally. In this paper we present a rule-based Setswana noun morphological analyzer and generator. The analyser and generator implement morphological rules which are supported by a dictionary of root words with some attributes. Results show that Setswana nouns could mostly be analysed using morphological rules and the rules could also be used to generate other words. Adjectives, pronouns, adverbs and enumeratives are also included. The generator shows that Setswana nouns, adjectives and adverbs are less productive compared to verbs. The analyzer gives a 79% performance rate and the generator 92%. The analyser rules fail when multiple words have the same intermediate word and with homographs. The generator failures are due to over generation and under generation.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
A Tool to Search and Convert Reduplicate Words from Hindi to PunjabiIJERA Editor
This document discusses reduplication in Hindi and proposes developing a tool to identify and translate reduplicated words from Hindi to Punjabi. It begins by defining reduplication and describing different types, including morphological, lexical, echo formation, compounds, word and discontinuous reduplication. It then states the problem of automatically identifying echo words in Hindi conversations to translate to Punjabi. Next, it discusses using rule-based and statistical approaches and choosing rule-based. Finally, it outlines the work to be done, including studying reduplication, past research, translating echo words, and rule-based approaches.
This document describes a rule-based machine translation system for translating English text to Telugu. It discusses the challenges of developing such a system, including differences in grammar between the two languages. An algorithm is proposed that uses rules, probabilities, and rough sets to classify sentences and select the best word translations. The system works by tokenizing English sentences, tagging the words with parts of speech, looking up word translations in a bilingual dictionary, and concatenating the Telugu words to form the output sentence.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
Myanmar word sorting is very important in indexing of search engine to optimize in the searching process of keywords. This paper proposed an efficient sorting algorithm for Myanmar words based on the weights of consonants, vowels, devowelizers, and consonant combination of each syllable of the words since Myanmar words are composed of one syllable or more than one syllable and finally the words are sorted based on Quick sort. The proposed algorithm is intended to design for Zawgyi_One font, which is mainly dominant in Myanmar Web pages.
Anaphora resolution in hindi language using gazetteer methodijcsa
Anaphora resolution is one of the active research areas within the realm of natural language processing.
Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This
paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various
methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution
in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies
operations to classify elements present in the list. There are many salient factors for resolving anaphora.
The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor
always represent living things and non living things whereas Recency describes that the referents
mentioned in current sentence tends to have higher weights than those in previous sentence. This paper
demonstrate the experiments conducted on short Hindi stories ,news articles and biography content from
Wikipedia, its result & future directions to improve accuracy.
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
This document describes a Marathi text-to-speech (TTS) synthesis system based on a concatenative approach using syllables as the basic speech units. The system analyzes input text, performs syllabification based on linguistic rules, retrieves corresponding speech files from a corpus, concatenates the files while minimizing discontinuities at boundaries, and outputs synthesized speech. A Marathi speech corpus was created containing over 1000 sentences from various domains. Subjective quality tests found the synthesized speech to have naturalness and intelligibility comparable to natural speech. The system demonstrates an effective approach for Marathi TTS using a syllable-based concatenative method.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
The document presents an unsupervised approach for developing stemmers for Urdu and Marathi languages. It uses frequency-based and length-based stripping approaches to automatically learn and generate suffix rules from word corpora without any linguistic knowledge. The frequency-based approach achieved 85.36% accuracy for Urdu and 82.5% for Marathi, outperforming the length-based approach. The stemmer could improve information retrieval for languages by reducing morphological variants to word stems.
Word morphology is a process of analysing word formation. Morphological analysis is one of the pre-processing steps in natural language processing tasks. Few studies have looked at Setswana noun morphology analysis and generation computationally. In this paper we present a rule-based Setswana noun morphological analyzer and generator. The analyser and generator implement morphological rules which are supported by a dictionary of root words with some attributes. Results show that Setswana nouns could mostly be analysed using morphological rules and the rules could also be used to generate other words. Adjectives, pronouns, adverbs and enumeratives are also included. The generator shows that Setswana nouns, adjectives and adverbs are less productive compared to verbs. The analyzer gives a 79% performance rate and the generator 92%. The analyser rules fail when multiple words have the same intermediate word and with homographs. The generator failures are due to over generation and under generation.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
The document proposes using part-of-speech tagging and stemming to improve Gujarati to Hindi machine translation through transliteration. It presents a system that applies stemming and POS tagging to Gujarati text before transliterating to resolve ambiguities. An evaluation of the system on 500 sentences found that transliteration and translation matched for 54.48% of Gujarati words, and overall transliteration efficiency was 93.09%. The approach aims to improve over direct transliteration for the highly inflected Gujarati language.
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...kevig
With the advent of social media, the amount of text available for processing across different natural languages has become enormous. In the past few decades, there has been tremendous increase in the number of language processing applications. The tools for natural language computing of various languages are very different because each language has its own set of grammatical rules. This paper focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for
word inflections in Tamil can be grouped in these three levels and applied for obtaining different word forms is the focus of this paper. These can be made use of in a wide variety of natural language applications like morphological analysis, orphological generation, word level translation, spelling and grammar check, information extraction etc. The tools using these rules will account for faster operation and better implementation of Tamil grammatical rules referred from [த ொல்த ொப்பியம் |
tholgaappiyam] and [ நன்னூல் | nannool] in NLP applications.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
Morphological stemming becomes a critical step toward natural language processing. The process of
stemming is to reduce alternative forms to a common morphological root. Word segmentation for
Myanmar Language, like for most Asian Languages, is an important task and extensively-studied
sequence labelling problem. Named entity detection is one of the issues in Asian Language that has
traditionally required a large amount of feature engineering to achieve high performance. The new
approach is integrating them that would benefit in all these processes. In recent years, end-to-end
sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks.
We trained the model using manually annotated corpora. State-of-the-art named entity recognition
systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that
relies on two sources of information: character level representation and syllable level representation.
This document discusses various methods for teaching grammar:
- The deductive method involves explaining grammar rules first before providing examples.
- The inductive method presents examples first to allow students to observe patterns and derive rules themselves.
- The informal method focuses on correcting usage rather than explicit instruction of rules.
- The incidental method teaches grammar implicitly through activities like translating texts or compositions.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
The document discusses various linguistic concepts related to grammar. It begins by defining grammar in both the broad and narrow sense. In the broad sense, grammar includes all aspects of language like morphology, syntax, semantics etc. In the narrow sense, grammar refers specifically to word formation and sentence structure.
It then contrasts prescriptive grammar, which provides normative rules, with descriptive grammar, which objectively describes language as used. There is disagreement between these approaches on sentences like "I don't know nothing".
The document also discusses key grammatical units like phrases, clauses, and morphemes. It explains differences between concepts like stems, roots, affixes and allomorphs. Finally, it outlines different types
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
This document describes the design and evaluation of a rule-based stemming algorithm for the Kambaata language. The algorithm uses a single-pass, longest-matching approach and is context-sensitive. It was evaluated on a test set of 2,425 words, correctly stemming 2,349 words (96.87%), overstemming 63 words (2.60%), and understemming 13 words (0.54%). The stemmer achieved a 65.86% reduction in dictionary size. Errors are due to the language's complex morphology involving suffixation, infixation, compounding, blending and reduplication. This represents the first stemming algorithm developed for Kambaata words.
Jayakumar sentence pattern method-a new approach for teaching spoken english ...Jayakumar K S
This article discusses a new Sentence Pattern Method (SPM) for teaching spoken English to EFL learners. The SPM helps learners convert thoughts in their mother tongue to English sentences easily by using templates with variables. The method was tested successfully with Tamil students, improving their fluency, interest, and reducing nervousness. Key aspects included using simple sentences, focusing on present tense, and creating patterns for common phrases in students' mother tongue. Research findings showed students benefited from understanding fluency problems and applying the SPM approach.
The objective of the research is to classify the serial-verb constructions in Thai automatically by using the
word classes from Thai WordNet to classify verbs in the sentence. Due to the Thai language has the extendto-
the-right structure and put the adjective after the noun. Its overall grammar characteristic is the
"Subject-Verb-Object" or SVO type. And Thai language can be communicated using one verb after another
within the same sentence, that we called "Serial Verb". Today we already have many researches about this
serial-verb constructions, but no research is about its automatic classification.
This document outlines how a concordancer can be used in language teaching and learning. It defines a concordancer as software that allows users to search language corpora to analyze word usage. The document discusses how concordancers can help students learn collocations, understand different word meanings and usages, find authentic examples, and learn from their own errors. Some potential issues are that corpora may be too advanced for lower levels and interfaces can be complex. Overall, the document argues that while concordancers present initial challenges, they can become very useful tools when incorporated into language instruction.
This document outlines how a concordancer can be used in language teaching and learning. It defines a concordancer as software that allows users to search language corpora to analyze word usage. The document discusses how concordancers can help students learn collocations, understand different word meanings and usages, find authentic examples, and learn from their own errors. Some potential issues are that corpora may be too advanced for lower levels and interfaces can be complex. However, the document concludes that while incorporating concordancers takes time, it can ultimately be a useful language learning tool.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Similar to DESIGN OF A RULE BASED HINDI LEMMATIZER (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
This document discusses using social media technologies to promote student engagement in a software project management course. It describes the course and objectives of enhancing communication. It discusses using Facebook for 4 years, then switching to WhatsApp based on student feedback, and finally introducing Slack to enable personalized team communication. Surveys found students engaged and satisfied with all three tools, though less familiar with Slack. The conclusion is that social media promotes engagement but familiarity with the tool also impacts satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The document proposes a blockchain-based digital currency and streaming platform called GoMAA to address issues of piracy in the online music streaming industry. Key points:
- GoMAA would use a digital token on the iMediaStreams blockchain to enable secure dissemination and tracking of streamed content. Content owners could control access and track consumption of released content.
- Original media files would be converted to a Secure Portable Streaming (SPS) format, embedding watermarks and smart contract data to indicate ownership and enable validation on the blockchain.
- A browser plugin would provide wallets for fans to collect GoMAA tokens as rewards for consuming content, incentivizing participation and addressing royalty discrepancies by recording
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This document discusses the importance of verb suffix mapping in discourse translation from English to Telugu. It explains that after anaphora resolution, the verbs must be changed to agree with the gender, number, and person features of the subject or anaphoric pronoun. Verbs in Telugu inflect based on these features, while verbs in English only inflect based on number and person. Several examples are provided that demonstrate how the Telugu verb changes based on whether the subject or pronoun is masculine, feminine, neuter, singular or plural. Proper verb suffix mapping is essential for generating natural and coherent translations while preserving the context and meaning of the original discourse.
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The document discusses automated penetration testing and provides an overview. It compares manual and automated penetration testing, noting that automated testing allows for faster, more standardized and repeatable tests but has limitations in developing new exploits. It also reviews some current automated penetration testing methodologies and tools, including those using HTTP/TCP/IP attacks, linking common scanning tools, a Python-based tool targeting databases, and one using POMDPs for multi-step penetration test planning under uncertainty. The document concludes that automated testing is more efficient than manual for known vulnerabilities but cannot replace manual testing for discovering new exploits.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
The document proposes a new validation method for fuzzy association rules based on three steps: (1) applying the EFAR-PN algorithm to extract a generic base of non-redundant fuzzy association rules using fuzzy formal concept analysis, (2) categorizing the extracted rules into groups, and (3) evaluating the relevance of the rules using structural equation modeling, specifically partial least squares. The method aims to address issues with existing fuzzy association rule extraction algorithms such as large numbers of extracted rules, redundancy, and difficulties with manual validation.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
2. 68 Computer Science & Information Technology (CS & IT)
word गल तय which when stemmed gives गल तय, but it is not a proper root. Over-stemming is a
case which occurs when the words need not to be grouped together, but are actually grouped. It
occurs mainly in case of semantically distant words. For example – the Hindi word चहेता when
stemmed gives us चहे which is again not a proper root word. Thus we can say that stemmer does
not provide us the contextual knowledge which is provided by the lemmatizer. Since English and
other European languages are not highly inflected therefore there have been many stemmers and
lemmatizers developed for them, but if we talk about Indian languages, they are highly inflected.
Lemmatizer for such languages is rarely found. In this paper we are emphasizing on Hindi
language. Hindi is the official language of India. It is most widely spoken in almost all the parts
of the country. So, in order to preserve the language and its root words we have developed a
lemmatizer which contains various rules. The study however does not include all the rules but can
be taken as a prototype for extending the functionality of the system. We have made an attempt to
make an automated lemmatizer using the rules. This system can be efficiently used for retrieving
the information.
The paper is organized as follows: the first section described the introduction and problems of
stemming. The next section discusses the literature work done in this research area. Further we
discuss the linguistic background of Hindi. Then we propose our work in which we have shown
the methodology and approach that we have used. Further we have shown the processing
technique in which we have discussed some examples and provide some limitations of the system
depending on the evaluation result. The final section concludes the study of the paper.
2. RELATED WORK
A lot of research work has been done and is still going on for the development of a stemmer as
well as lemmatizer. The first stemmer was developed by Julie Beth Lovins [1] in 1968. Later the
stemmer was improved by Martin Porter [2] in July, 1980 for English language. The proposed
algorithm is one of the most accepted methods for stemming where automatic removal of affixes
is done from English words. The algorithm has been implemented as a program in BCPL. Much
work has been done in developing the lemmatizer of English and other European languages. In
contrast, very little work has been done for the development of lemmatization for Indian
languages.
A rule based approach proposed by Plisson et al. [3] is one of the most accepted lemmatizing
algorithms. It is based on the word endings where the suffix should be removed or added to get
the normalized form. It emphasizes on two word lemmatization algorithm which is based on if-
then rules and the ripple down approach. The work proposed by Goyal et al. [6] focuses on the
development of a morphological analyzer and generator. They aimed to develop a translation
system especially from Hindi to Punjabi. Nikhil K V S [8] built a Hindi derivational analyzer
using a specific tool. He used supervised approach by creating a SVM classifier. Jena et al. [9]
proposed a morphological analyzer for Oriya language by using the paradigm approach. They
classified nouns, adjectives and finite verbs of Oriya by using various paradigm tables. Anand
Kumar et al. [7] developed an automatic system for the analysis of Tamil morphology. They used
various methodologies, rule based approach and sequence labeling containing the non linear
relationships of morphological features from the training data in a better way.
3. Computer Science & Information Technology (CS & IT) 69
Chachoo et al. [5] used an extract tool named Extract v2.0 for the development of the
orthographic component of Kashmiri Script. A method has been proposed by Majumder et al.
[12] in which a clustering based approach is used for discovering the equivalent classes of root
words. This algorithm was tested for two languages French and Bangla. A rule based approach
for stemming in Hindi was proposed by Ramanathan & Rao [11]. The approach is based on
stripping off suffixes by generating rules emphasizing on noun, adjective and verb inflections in
Hindi. Bharti Akshar et al. [4] proposed the work on natural language processing where they
gave a detailed study of morphology using paradigm approach.
3. LINGUISTIC BACKGROUND OF HINDI
Morphemes play a vital role in lemmatization. This is the major way in which morphologists
investigate the words. Their formation and internal structure is studied in deep. Morphology is
broadly categorized into two parts: derivational morphology and inflectional morphology.
Derivational morphology processes the words and form new lexemes from the existing ones. This
is done by either adding or deleting affixes. For example – सजा + वट= सजावट. Here the word
class is changed from adjective to noun. Similarly, in English we have words like employ + ee =
employee, where the word class is changed from verb to noun. Inflectional morphology processes
the words by producing various inflections without changing the word class. For example – कलम
+ दान = कलमदान where both कलम and कलमदान is noun/singular. The word class remains same
here. The root form of the words basically comes under noun and verb classes. This knowledge
lead us to trace the paradigm approach. According to Smriti Singh and Vaijayanthi M Sarma [10],
Hindi noun classification system shows only the number and case for morphological analysis.
Number basically includes either singular or plural. By default a word is kept singular. In Hindi
we have two types of Cases – direct and oblique. Oblique words show the case as well as the
number of the word. For example – लड़क – ◌ा, लड़क – ◌े, here ◌ा shows singular number whereas
◌े shows plural number. Similarly we have some gender rules. In Hindi, words ending with the
suffix ◌ी are termed as feminine whereas the words that end with suffix ◌ा are termed as
masculine. For example लड़का, नेता, घोड़ा, कटोरा, ब चा and many more are masculine ending with
◌ा while लड़क , धोबी,पु ी, कटोर , ब चीand many more are feminine ending with ◌ी. But we have
many more words that contradict this concept. For example – the word पानी (water) is masculine,
although it is ending with ◌ी. Similarly the word माला (garland) is feminine even though ending
with ◌ा. There are some other words from which the suffix cannot be removed. For example – let
us consider the suffix ◌ा, the words पता, माता, ब चा, कटोरा, नेता, and many more does not
require stemming. Such words need to be maintained as it is and should be refrained from being
stemmed. So it is found that Hindi is a highly inflected language and needs a deep study of word
structure and its formation.
4. PROPOSED WORK
In this paper we have discussed about the creation of a Hindi lemmatizer. The approach for the
creation is based on the key concept of optimization. Optimization includes both space and time,
4. 70 Computer Science & Information Technology (CS & IT)
so our approach is based on these parameters. The lemmatizer that we discuss here mainly
focuses on the time complexity problem. Typically lemmatizer is built using a rule based
approach and paradigm approach. In rule based approach along with the rules, knowledgebase is
created for storing the grammatical features. Knowledgebase is also created for storing the
exceptional root words. That is we need some root words as it is containing the suffix. Although
the knowledgebase creation requires a large amount of memory, but in respect of time it gives us
the best, accurate and fast result. The reason behind this fast retrieval is that, a very short time is
taken to search the input word from the knowledgebase. The study [7] shows that Tamil words
have infinite set of inflections but Hindi words have finite set of inflections which are quite easy
to maintain in the knowledgebase. We have restricted our knowledgebase to commonly used
words which do not contain the proper nouns like the names of person and place.
4.1 SUFFIX GENERATION
We have gone through various words with their suffixes and examined the morphological
changes for the development of a lemmatizer. These suffixes and changes led to the development
of specific rules. For example – If we take the word खराबी (defect) then we find that the word is
derived by adding ◌ी suffix to the word खराब (bad) which transform noun to adjective. Similarly
there are many other words with the same suffix. Some of them are shown in table 1 and 2.
Table 1. Example of derived words with suffix ◌ी & ई (noun to adjective)
Root Word Derived Word
साफ़ सफ़ाई
ऊँ चा ऊँ चाई
मोटा मोटाई
गर ब गर बी
सद सद
Table 2. Some more suffixes are as follows
Root Word Derived Word Suffix
गाड़ी गा ड़य य
मीठा मठाई ई
प व प व ता ता
जादू जादूगर गर
रोशन रोशनदान दान
चढ़ चढ़ाई ◌ाई
Hindi words are large in number and for this reason the extraction of suffixes had a large list.
Since the work has been done manually therefore this phase was quite time consuming. The
5. Computer Science & Information Technology (CS & IT) 71
suffixes were generated by processing a corpus of 40,000 sentences from which 75 lakh words
were manually stemmed among them 124 suffixes were derived.
4.2 RULES GENERATION
After generating the suffix list we have developed rules. We have created 124 rules which are
framed in such a way that the suffix gets removed from the input word and if required, addition of
character or ‘maatra’ takes place. For example – let us take the suffix ◌ो◌ं. Some of the words
containing this suffix are shown in table 3.
Table 3. Words showing the suffix ◌ो◌ं
Rule application
Word Root Extraction of
suffix
Addition of character
लड़क लड़का ◌ो◌ं ◌ा
सड़क सड़क ◌ो◌ं __
लेखक लेखक ◌ो◌ं __
In the above table on removing the suffix ◌ो◌ं we get their respective root word, but the word
लड़क is an exception here because on removing the suffix ◌ो◌ं we need to add ◌ा to the last letter
of the word to make it a genuine root word’ लड़का.’ Similarly there are many other rules for
removing the suffix and if necessary addition of character may also takes place. Similarly we also
have some other rules, like the rule for extracting the suffix ि◌य which is shown in table 4.
Hindi has a grammar rule, according to which when the plural is removed, we need to add ◌ी to
the last letter of the word. Table 4 mentions the rule for the suffix ि◌य in which we have created
a general rule for removing the suffix and adding ◌ी to the word, but we have some exceptions
here which include the addition of ि◌ instead of ◌ी. In the below table we have also shown an
exception in the last word च ड़य where the root form is च ड़या. The word च ड़य contains two
suffixes together which are ि◌य and ◌ो◌ं. This becomes hard for the system as it finds difficulty
in picking up the correct rule for the particular word. Similarly there are many more exceptions
for which we have generated different rules. To overcome such problems we have built a
database in which such exceptional words are kept. Although this work requires much time but
for the sake of fast and accurate result this approach is applied.
6. 72 Computer Science & Information Technology (CS & IT)
Table 4. Words showing the suffix ि◌य
Rule application
Word Root Extraction of suffix Addition of character
लड़ कय लड़क ि◌य ◌ी
कहा नय कहानी ि◌य ◌ी
क वय क व ि◌य ि◌ (exception)
च ड़य च ड़या ि◌य ि◌या (exception)
5. PROCESSING TECHNIQUE
Fig 1. Schematic diagram of the system
The foremost step is to read the input word. The database contains all the root words. The input is
checked in the database, if it is present in the database then the word is displayed as it is. If the
word is not present in the database then it comes down to access the rules. After accessing the
rules the root word is generated and displayed. The rule is followed as -
If (root) present in (root list)
{
Fetch the root from the list
Display;
}
else if (root) not present in (root list)
{
If (source) ends with (suffix)
{
7. Computer Science & Information Technology (CS & IT) 73
Substring the source
Display the root;
}
}
6. ILLUSTRATION
As an illustration we gave a set of Hindi words in order to analyze the output. Some of them are –
Table 5. Stemmed Output
Input Output
च ड़य च ड़या
लड़ कय लड़क
भारतीयता भारत
तभाशाल तभा
गौरवां वत गौरव
7. EVALUATION
The system is evaluated for its accuracy where we gave 2500 words for analysis. Among these
2500 words 2227 words were evaluated correctly and 273 words were incorrect because they
violated both the exceptional and general rules.
Fig. 1. Graphical representation of evaluation
Our system gave 89.08% accuracy. We used the following formula to calculate the accuracy.
Accuracy % = Total correct lemmas / Total words
8. CONCLUSION
In this paper we have discussed the development of Hindi lemmatizer. The work focuses on rule
based approach along with this paradigm approach is also used in which we have created
knowledgebase containing all the Hindi root words that are commonly used in day to day life.
The main aim is emphasized on time optimization problem rather than on space. Since nowadays
8. 74 Computer Science & Information Technology (CS & IT)
space is not at all a big problem, therefore our approach aimed to optimize time and generate
accurate result in a very short period. Our system gave 89.08% of accuracy.
REFERENCES
[1] Julie Beth Lovins, Development of stemming Algorithm, Mechanical Translation and Computational
Linguistics, Vol. 11, No. 1, pp 22-23, 1968.
[2] Martin F. Porter, An algorithm for suffix stripping, Program, Vol. 14, No. 3, pp 130-137,1980.
[3] Plisson, J, Larc, N, Mladenic, D.: A Rule based approach to word lemmatization, Proceedings of the
7th International Multiconference Information Society, IS-2004, Institut Jozef Stefan, Ljubljana, pp.
83-86, 2008.
[4] Bharti Akshar, Vineet Chaitanya, Rajeev Sangal, Natural Language Processing: A Paninian
Perspective. Prentice-Hall of India, 1995.
[5] Manzoor Ahmad Chachoo, S.M.K. Quadri. Morphological Analysis from the Raw Kashmiri Corpus
Using Open Source Extract Tool, Vol. 7, No. 2, 2011.
[6] Vishal Goyal, Gurpreet Singh Lehal, Hindi Morphological Analyzer and Generator, IEEE Computer
Society Press, California, USA pp. 1156-1159, 2008.
[7] Anand Kumar M, Dhanalakshmi V, Sonam K P, A sequence labeling approach to morphological
analyzer for tamil language, International Journal on Computer Science and Engineering, Vol. 20, No.
06, 2010.
[8] Nikhil K V S, Hindi derivational morphological analyzer, Language Technologies Research Center,
IIIT Hyderabad, 2012.
[9] Itisree Jena, Sriram Chaudhary, Himani Chaudhary, Dipti M. Sharma, Developing Oriya
Morphological Analyzer Using Lt-toolbox, ICISIL 2011, CCIS 139, pp. 124-129, 2011.
[10] Smriti Singh, Vaijayanthi M Sarma. Hindi Noun Inflection and Distributed Morphology.
[11] A. Ramanathan and D.D Rao, A Light Weight Stemmer for Hindi, In Proceedings of Workshop on
Computational Linguistics for South Asian Languages, 10th Conference of the European Chapter of
Association of Computational Linguistics, pp. 42-48, 2003.
[12] Prasenjit Majumder, Mandar Mitra, Swapan k. Pauri, Gobinda Kole, Pabitra Mitra and Kalyankumar
Datta, YASS: Yet Another Suffix Stripper, ACM Transactions on Information Systems, Vol.25,
NO.4, pp. 18-38, 2007.