This document provides an overview of basic probability concepts and statistical methods. It discusses probability as it relates to outcomes and events, and as the tool used in statistics to make inferences from samples. It then covers specific probability concepts like n-gram models, which use the previous n-1 words to predict the next word. The document also summarizes part-of-speech tagging methods, including rule-based, supervised stochastic, and unsupervised approaches. Freely available POS taggers for various languages are also listed.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
This research paper presents two empirical studies that examine the influence of different linguistic aspects on
prosody in Marathi. First, we analyzed a Marathi corpus with respect to the effect of syntax and information
status on prosody. Second, we conducted a listening test which investigated the prosodic realisation of
constituents in the Marathi depending on their information status. The results were used to improve the prosody
prediction in the Marathi text-to-speech synthesis system MARY.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
The Presentation contains about Word Sense Diassambiguation. I had tried to explain about the Word Sense in terms of Python language. But it can be also done using nltk.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
The objective of this workshop is to show how natural language processing applied in modern applications such as Google Search, Apple Siri, Bing Translator and etc. During the workshop we will go through history if natural language processing, talk about typical problems, consider classical approaches and methods, and compare them with state-of-the-art deep learning techniques.
Author: Rudolf Eremyan
Email: eremyan.rudolf@gmail.com
Phone: +995599607066
LinkedIn: https://www.linkedin.com/in/rudolferemyan/
DataFest Tbilisi 2017 website: https://datafest.ge
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
This research paper presents two empirical studies that examine the influence of different linguistic aspects on
prosody in Marathi. First, we analyzed a Marathi corpus with respect to the effect of syntax and information
status on prosody. Second, we conducted a listening test which investigated the prosodic realisation of
constituents in the Marathi depending on their information status. The results were used to improve the prosody
prediction in the Marathi text-to-speech synthesis system MARY.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
The Presentation contains about Word Sense Diassambiguation. I had tried to explain about the Word Sense in terms of Python language. But it can be also done using nltk.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
The objective of this workshop is to show how natural language processing applied in modern applications such as Google Search, Apple Siri, Bing Translator and etc. During the workshop we will go through history if natural language processing, talk about typical problems, consider classical approaches and methods, and compare them with state-of-the-art deep learning techniques.
Author: Rudolf Eremyan
Email: eremyan.rudolf@gmail.com
Phone: +995599607066
LinkedIn: https://www.linkedin.com/in/rudolferemyan/
DataFest Tbilisi 2017 website: https://datafest.ge
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
A Word Stemming Algorithm for Hausa Languageiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
1. Basic concepts of Probability
and Statistics
Thennarasu Sakkan
Department of Linguistics
Central University of Kerala
2. A probability provides a quantitative description of
the chances or likelihoods associated with various
outcomes.
Probability is the tool that statistical methods use in
order to make inferences about the characteristics of a
population given a random sample of data.
Understanding probability is therefore a key to
understand the statistics.
3. The probability of an event A :
P(A) = NA / N
Where N is the number of possible outcomes of the
random experiment and NA is the number of
outcomes favourable to the event A.
For example,
for a 6-sided die there are 6 outcomes and 3 of
them are even, and thus
P(even) = 3/6
4. Probability theory is a formal way of representing
probabilistic concepts and describing uncertain
events.
Probability is a mapping from the set of events or
sample space into the set [0, 1].
Naturally, the probability of a particular event or set
of events is the fraction of the time that the particular
event or set of events occur.
Thus, a probability mapping goes from the set of all
possible events to their respective probabilities of
occurring.
5. Probability’s empirical counterparts are proportions
(between 0 and 1) and percentages (between 0 and
100).
Since something must always occur, probabilities
always add up to 1 (as long as all possible events are
included in the sum).
Since no one event can happen less than 0% of the
time or more than 100% of the time, an individual
probability must be between 0 and 1.
6. LANGUAGE MODEL
Language modelling refers to the task of modelling the
language using probabilities.
Language model is one of the important requirements
in statistical machine translation.
This component takes care the fluency of the given
language.
i.e. how much is the given sentence probable
quantitatively; it assigns high probability to plausible
sentences.
7. Language model does not give any guarantee on
syntax or semantics of the language being modelled.
An n-gram is a contiguous sequence of n items from a
given sequence of text.
Let us start with word prediction using simple n-grams.
Our goal is to calculate the probability of a word w given
some history h, or mathematically Pr(w|h).
N-gram model is a widely used language modeling tool,
found crucial in applications such as SR, spelling
correction, word prediction, POS tagging, natural
language generation and word similarity.
8. An n-gram model
An n-gram model is a type of probabilistic model for
predicting the next item in a text sequence.
n-grams are used in various areas of statistical
natural language processing and genetic sequence
analysis.
It use the previous N-1 words in a sequence to predict
the next word.
The items in question can be phonemes, syllables,
letters, words or base pairs according to the
application.
9. N-gram models can be imagined as placing a small
window over a sentence or a text, in which only n
words are visible at the same time.
The simplest n-gram model is therefore a so-called
unigram model.
This is a model in which we only look at one word at
a time.
An n-gram of size 1 is referred to as a "unigram"; size
2 is a "bigram" (or, less commonly, a "digram"); size 3
is a "trigram"; and size 4 or more is simply called an
"n-gram"
14. Collocations
The notion collocation used in lexicography in the 19th
century.
What is a collocation?
A collocation is a pair or group of words that are
often used together.
These combinations sound natural to native speakers,
but students of other language have to make a special
effort to learn them because they are often difficult
to guess.
15. A straightforward application of bigrams is the
identification of so-called collocations.
Recall that bigram language models exploit the
observations that words do not simply combine in
any random order, that is, word order is constraint by
grammatical structure. (e.g. phrase)
However, some combinations of words are subject to
an additional law of constraint.
16. Such combinations are commonly known as collocations.
– Examples of collocations are:
• United States
• vice president
• chief executive, chief office etc.
Corpus linguists study such collocations to answer
interesting questions about the combinatory properties
of words.
Collocations are a feature of natural languages that are not
well addressed by current language teaching and current
models used for NLP.
17. According to Benson et al, there are two types of
collocations; i) lexical and ii) grammatical
collocations.
i) lexical collocations such as
noun + noun,
adjective + noun,
ii) Grammatical collocations such as
noun + suffixes etc.
29. How to generate collocation out of a
corpus text?….
To take a list of modern collocations….
30. POS tagging and approaches
Part of Speech (POS) tagging is the process of labeling
a Part of Speech category to each and every word in
a text.
POS tagging is considered to be an important process
in speech recognition, natural language parsing,
morphological parsing, information retrieval and
machine translation.
Automatic Part-of-Speech tagger can help in
building automatic word-sense disambiguating
algorithms.
31. Parts of Speech are very often used for shallow parsing
texts, or for finding Noun and other phrases for
information extraction applications.
The corpora that have been marked for Part-of-
Speech are very useful for linguistic research,
For example, to find frequencies of a particular word
or sentence constructions in large corpora.
Apart from these, many Natural Language Processing
(NLP) activities such as summarization, Natural
Language Understanding (NLU) and Question
Answering (QA) systems are dependent on Part-of-
Speech Tagging.
32. Approaches to POS Tagging
POS taggers are broadly classified into three categories
called rule based, Empirical based and Hybrid based.
In case of rule based approach hand-written rules
are used to distinguish the tag ambiguity.
The empirical POS taggers are further classified
into Example based and Stochastic based taggers.
33. Stochastic taggers are either HMM based, choosing the
tag sequence which maximizes the product of word
likelihood and tag sequence probability, or cue-based,
using decision trees or maximum entropy models to
combine probabilistic features.
The stochastic taggers are further classified in to
supervised and unsupervised taggers. Each of these
supervised and unsupervised taggers are categorized
into different groups as below:
35. POS Tagging
UnsupervisedSupervised
Rule Based Stochastic Neural Rule Based Stochastic Neural
Brill Brill
N-gram
based
Maximum
Likelihood
Hidden Markov
Model
Baum-Welch
Algorithm
Viterbi
Algorithm
Classification of POS tagging models
36. Rule-based taggers generally involve a large database
of hand-written disambiguation rules.
For example, that an ambiguous word is a noun rather
than a verb if it follows a determiner.
Among those rule-based part-of-speech taggers, the
one built by Brill has the advantage of learning
tagging rules automatically.
Stochastic taggers generally resolve tagging
ambiguities by using a training corpus to compute the
probability of a given word having a given tag in a
given context.
37. Supervised POS tagging
The supervised POS tagging models require pre-
tagged corpora which are used for training to learn
rule sets, information about the tagset, word-tag
frequencies etc.
The learning tool generates trained models along
with the statistical information.
The performance of the models generally increases
with increase in the size of pre-tagged corpus.
38. Unsupervised POS tagging
Unlike the supervised models, the unsupervised POS
tagging models do not require a pre-tagged corpus.
Instead, they use advanced computational methods
like the Baum-Welch algorithm to automatically induce
tagsets, transformation rules etc.
Based on the information, they either calculate the
probabilistic information needed by the stochastic
taggers or induce the contextual rules needed by rule-
based systems or transformation based systems.
39. Rule based POS tagging
The rule based POS tagging models apply a set of hand written
rules and use contextual information to assign POS tags to
words in a sentence.
These rules are often known as context frame rules. For example,
a context frame rule might say something like:
“If an ambiguous/unknown word X is preceded by a Determiner
and followed by a Noun, tag it as an Adjective.”
On the other hand, the transformation based approaches use a
pre-defined set of handcrafted rules as well as automatically
induced rules that are generated during training.
40. Some models also use information about capitalization and
punctuation, the usefulness of which are largely dependent
on the language being tagged.
The earliest algorithms for automatically assigning Part-of-
Speech were based on a two-stage architecture [Harris Z. S,
1962].
The first stage used a dictionary to assign each word a list of
potential parts of speech.
The second stage used large lists of hand-written disambiguation
rules to bring down this list to a single Part-of-Speech for each
word.
41. The ENGTWOL [Voutilainen Atro, 1995] tagger is based on the
same two-stage architecture, although both the lexicon and the
disambiguation rules are much more sophisticated than the early
algorithms.
The ENGTWOL lexicon is based on the two-level morphology.
It has about 56,000 entries for English word stems, counting a
word with multiple parts of speech (e.g. nominal and verbal
senses of hit) as separate entries, and of course not counting
inflected and many derived forms.
Each entry is annotated with a set of morphological and
syntactic features. In the first stage of the tagger, each word is
run through the two-level lexicon transducer and the entries for
all possible parts of speech are returned.
42.
43. Stochastic POS tagging
A stochastic approach includes frequency, probability or
statistics. The simplest stochastic approach finds out the
most frequently used tag for a specific word in the
annotated training data and uses this information to tag
that word in the unannotated text.
The problem with this approach is that it can come up with
sequences of tags for sentences that are not acceptable
according to the grammar rules of a language.
44. An alternative to the word frequency approach is known as the
n-gram approach that calculates the probability of a given
sequence of tags.
It determines the best tag for a word by calculating the
probability that it occurs with the n previous tags, where the
value of n is set to 1, 2 or 3 for practical purposes.
The most common algorithm for implementing an n-gram
approach for tagging a new text is known as the Viterbi
Algorithm, which is a search algorithm that avoids the
polynomial expansion of a breadth first search by trimming
the search tree at each level using the best m Maximum
Likelihood Estimates (MLE) where m represents the number
of tags of the following word.
These are known as the unigram, bigram and trigram models.
45. • Very robust, can process any input strings
• Training is automatic, very fast
• Can be retrained for different corpora/tagsets
without much effort
• Language independent
• Minimize the human effort and human error.
http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/viterbi_al
gorithm/s1_pg1.html
Advantages of Statistical Approach
46. Apart from these, quiet a few different approaches to
tagging have been developed.
Support Vector Machines: This is the powerful machine
learning method used for various applications in NLP and
other areas like bio-informatics, data mining, etc.
Neural Networks: These are potential candidates for the
classification task since they learn abstractions from
examples [Schmid H, 1994].
Decision Trees:
A decision tree is a decision support tool that uses a tree-
like graph. It is one way to display an algorithm.
47. These are classification devices based on hierarchical
clusters of questions. They have been used for natural
language processing such as POS Tagging [Schmid
H, 1994].
The software “Weka” can be used for classifying the
ambiguous words.
48. Maximum Entropy Models: These avoid certain
problems of statistical interdependence and have
proven successful for tasks such as parsing and
POS tagging.
Example-Based Techniques: These techniques find
the training instance that is most similar to the
current problem instance and assume the same
class for the new problem instance as for the
similar one.
49. Freely downloadable Part of Speech Taggers for English
and other languages
Stanford POS tagger
Loglinear tagger in Java (by Kristina Toutanova)
hunpos
An HMM tagger with models available for English and
Hungarian. A reimplementation of TnT (see below) in
OCaml. pre-compiled models. Runs on Linux, Mac OS X,
and Windows.
MBT: Memory-based Tagger
Based on TiMBL
TreeTagger
http://nlp.stanford.edu/links/statnlp.html
50. • A decision tree based tagger from the University of
Stuttgart is language independent, but comes complete
with parameter files for English, German, Italian, Dutch,
French, Old French, Spanish, Bulgarian, and Russian.
(Linux, Sparc-Solaris, Windows, and Mac OS X versions.
Binary distribution only.) Page has links to sites where
one can run it online.
51. SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST (formerly ICOPOST)
Open source C taggers originally written by Ingo
Schröder. Implements maximum entropy, HMM trigram,
and transformation-based learning. C source available
under GNU public license.
MXPOST
Adwait Ratnaparkhi's Maximum Entropy part of
speech tagger
Java POS tagger
A sentence boundary detector (MXTERMINATOR)
is also included. Original version was only JDK1.1; later
version worked with JDK1.3+. Class files, not source.
52. fnTBL
A fast and flexible implementation of Transformation-
Based Learning in C++. Includes a POS tagger, but also
NP chunking and general chunking models.
mu-TBL
An implementation of a Transformation-based Learner
(a la Brill), usable for POS tagging and other things by
Torbjörn Lager. Web demo also available.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER,
etc. C/C++ open source. Won CoNLL 2000 shared task.
(Less automatic than a specialized POS tagger for an end
user.)
53. QTAG Part of speech tagger
An HMM-based Java POS tagger from
Birmingham U. (Oliver Mason). English and
German parameter files. [Java class files, not
source.]
The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the
decision to make this famous system available is very
interesting from an historical perspective, and for software
sharing in academia more generally. LOB tag set.
Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from a
canonical location, but one may find a version from the
Wikipedia page or one can try a reimplementation such as
fnTBL.
54. • Original Xerox Tagger
A common lisp HMM tagger available by ftp.
Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron
Coburn. Version 0.11. (A bigram HMM tagger.)
55. Development of POS Annotated Corpora
Corpus linguistics seeks to further the understanding of
language through the analysis of large quantities of naturally
occurring data.
Text corpora are used in a number of different ways.
Traditionally, corpora have been used for the study and analysis
of language at different levels of linguistic description.
Corpora have been constructed for the specific purpose of
acquiring knowledge for information extraction systems,
knowledge-based systems and e-business systems.
Corpora have been used for studying child language
development. Speech corpora play a vital role in the
specification, design and implementation of telephonic
communication and for the broadcast media.
56. There is a long tradition of corpus linguistic studies in
Europe. The need for corpus for a language is
multifarious(various types).
Starting from the preparation of a dictionary or lexicon to
machine translation, corpus has become an inevitable resource
for technological development of languages.
Corpus means a body of huge text incorporating various
types of textual materials, including newspaper, weeklies,
fictions, scientific writings, literary writings, and so on.
Corpus represents all the styles of a language. Corpus must
be very huge in size as it is going to be used for many
language applications such as preparation of lexicons of
different sizes, purposes and types, NLP tools, machine
translation programs and so on.
57. Corpuses can be distinguished as tagged corpus, parallel
corpus and aligned corpus.
The tagged corpus is that which is tagged for Part-of-Speech,
morphology, lemma, phrases etc.
A parallel corpus contains texts and translations in each of the
languages involved in it. It allows wider scopes for double-
checking of the translation equivalents.
Aligned corpus is a kind of bilingual corpus where text
samples of one language and their translations into another
language are aligned, sentence by sentence, phrase by
phrase, word by word, or even character by character.
58. Applications of POS tagged corpus
The POS tagged corpus is used in the following task.
– Chunking
– Parsing
– Information extraction and retrieval
– Tree bank creation
– Document classification
– Question answering
59. Applications of POS tagged corpus cont…
– Automatic dialogue system
– Speech processing
– Summarization
– Statistical training of Language models
– Machine Translation using multilingual corpora
– Text checkers for evaluating spelling and grammar
– Computer Lexicography
– Educational application like Computer Assisted
Language Learning
60. Complexity in Dravidian POS tagging
As Dravidian is an agglutinative language, Nouns get
inflected for number and cases. Verbs get inflected for
various inflections which include tense, person, number,
gender suffixes.
Verbs are adjectivalized and adverbialized. Also verbs
and adjectives are nominalized by means of certain
nominalizers. Adjectives and adverbs do not inflect.
Many post-positions in Tamil [Arden 1942; Rajendran S,
2007] are from nominal and verbal sources. So, many
times one has to depend on the syntactic function or
context to decide upon whether one is a noun or adjective
or adverb or postposition.
61. This leads to the complexity of Tamil in POS tagging.
Root ambiguity
The root word can be ambiguous. It can have more than one
sense, sometimes roots belong to more than one POS
category.
Though the POS can be disambiguated using contextual
information like co-occurring morphemes, it is not possible
always.
These issues should be taken care of when POS Taggers are
built for Tamil Language.
For example, the Tamil root words like adi, padi, isai, mudi,
kudi can take both noun and verb category which leads to the
root ambiguity problem in POS tagging.
62. Noun complexity
Nouns are the words which denote a person, place, thing,
time, etc. In Tamil language, nouns are inflected for the
number and case in morphological level.
Morphological level inflection
Noun ( + number ) (+ case )
Example: pUk-kaL-ai <NN>
Flower-plural-accusative case suffix
Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
Example: pUk-kaL-in-Al <NN>
Flower-plural-euphonic suffix-accusative case suffix
Nouns further need to be annotated into common noun,
compound noun, proper noun, compound proper noun,
pronoun, cardinal and ordinal.
63. Pronouns need to be further annotated for personal pronoun.
There occurs complexity between common noun and
compound noun and also between proper noun and
compound proper noun. Common noun can also occur as
compound noun, for example
UrAdci <NNC> thalaivar <NNC>
When UrAdci and thalaivar comes together it can be
compound noun (<NNC>), but when UrAdci and thalaivar
comes separately in a sentence it should be tagged as a
common noun (<NN>). Such complexity also occurs with the
proper noun <NNP> and compound proper noun (<NNPC>).
Moreover there occurs complexity between noun and adverb,
pronoun and emphasis in syntactic level.
64. Verb complexity
The verbal forms are complex in Tamil. A finite verb
shows the following morphological structure
Verb stem + Tense + Person-Number + Gender
Example: nada +nth +En <VF>
‘I walked’
A number of non-finite forms are possible: adverbial forms,
adjectival forms, infinitive forms, and conditional.
Verb stem + Adverbial participle
Example: cey + thu = ceythu <VNAV>
‘having done’
65. Verb stem + relative_participle
Example: cey + tha = ceytha <VNAJ>
‘who did’
Verb stem + infinitive suffix
Example: azu + a = aza <VINT>
‘to weep’
Verb stem + conditional suffix
Example: kEL+d + Al =kEddAl <CVB>
‘if asked’
Distinction needs to be made between a main verb followed
by a main verb and a main Verb followed by an auxiliary
verb.
The main verb followed by an auxiliary verb need to be
interpreted together, whereas the main verb followed by a
main verb need to be interpreted separately. This lead to
functional ambiguity as given below:
66. Developing Part-of- Speech tagger for
Indian languages
For Bengali, Sandipan et al., (2007), have developed a
corpus based semi-supervised learning algorithm for POS
tagging based on HMMs.
Their system uses a small tagged corpus (500 sentences) and a
large unannotated corpus along with a Bengali morphological
analyzer. When tested on a corpus of 100 sentences (1003
words), their system obtained an accuracy of 95%.
67. Smriti Singh et.al (2006), have proposed tagger for Hindi, that
uses the affix information stored in a word and assigns a
POS tag using no contextual information. By considering
the previous and the next word in the Verb Group (VG), it
correctly identifies the main verb and the auxiliaries.
Lexicon lookup was used for identifying the other POS
categories.
In NLPAI ML contest, Dalal et al (2006) have achieved
accuracies of 82.22 % and 82.4% for Hindi POS tagging and
chunking respectively using maximum entropy models.
Karthik et al. (2006) got 81.59 % accuracy for Telugu POS tagging
using HMMs.
Sivaji et al (2006) came up with a rule based chunker for Bengali
which gave an accuracy of 81.64 %. The training data for all the
three languages contained approximately 20,000 words and the
testing data had approximately 5000 words.
68. For Telugu, three POS taggers have been proposed by using
different POS tagging approaches viz., (1) Rule-based
approach, (2) Transformation based learning (TBL)
approach of Erich Brill (3) Maximum Entropy Model, a
machine learning technique [Ramasree, R.J and Kusuma
Kumari, P, 2007].
Hidden Markov Model (HMM) based tagger for Hindi was
proposed by Manish Shrivastava and Pushpak Bhattacharyya
(2008). The authors attempted to utilize the morphological
richness of the languages without resorting to complex and
expensive analysis. The core idea of their approach was to
explode the input in order to increase the length of the input
and to reduce the number of unique types encountered during
learning. This in turn increases the probability score of the
correct choice while simultaneously decreasing the ambiguity
of the choices at each stage.
69. A stochastic Hidden Markov Model (HMM) based part of
speech tagger has been proposed for Malayalam. To perform
parts of tagging speech using stochastic approach, an annotated
corpus is needed. Due to the non-availability of annotated
corpus, a morphological analyzer was also developed to
generate a tagged corpus from the training set [Manju K e.tal,
2009].
Various methodologies have been developed for POS Tagging
for Tamil language. A rule-based POS tagger for Tamil was
developed and tested [Arulmozhi et al., 2004]. This system
gives only the major tags and the sub tags are overlooked
during evaluation. A hybrid POS tagger for Tamil using HMM
technique and a rule based system was also developed
[Arulmozhi P and Sobha L, 2006].
70. Lakshmana Pandian S and Geetha T V (2008) have developed a
Morpheme based Language Model for Tamil Part-of-Speech
Tagging. A language model based on the information of the
stem type, last morpheme, and previous to the last morpheme
part of the word for categorizing its part of speech was
developed. For estimating the contribution factors of the model,
they have followed the generalized iterative scaling technique.
Dhanalakshmi et. al.(2008) proposed an SVM based tagger
using linear programming and developed their own POS tagset
for Tamil which has 32 tags. They used this tagset to annotate
their corpus and then trained their model and reported an
accuracy of 95.63%. Dhanalakshmi et. al.(2009) have also
proposed another tagger where they used machine learning
techniques to extract linguistic information which was then
used to train the tagger based on SVM approach. They used
their own 32 tags tagset for annotating the corpus and reported
an accuracy of 95.64%.
71. Considerable Effort of developing a POS Tagger in other
Indian Languages have also been put in for Malayalam, an
HMM based tagger was proposed by Manju et. al., since
they did not had an annotated corpus, they used a
morphological analyzer to generate the corpus which was
then used for training the HMM algorithm. Another tagger
for Malayalam was developed by Anthony et. al. [2009] who
used Support Vector Machines (SVM). They used a
SVMTool for tagging which was developed by Giménez and
Màrquez. For developing this tagger Anthony et. al. first
proposed a tagset which they claim is suitable for
Malayalam and then created an annotated corpus using this
tagset. Their tagger reported 94% accuracy with their tagset.
72. Word Sense Disambiguation
• Word sense disambiguation (WSD) is the ability to
identify the meaning of words in context in a
computational manner. WSD is considered an AI-
complete problem, that is, a task whose solution is at least
as hard as the most difficult problems in artificial
intelligence.
A striking feature of Natural Language is that many words
and sentences have more than one meaning (i.e. are
semantically ambiguous), and which meaning is correct
depends on the context. This problem arises at several
levels.
73. There are problems at the level of individual words. Consider
this example
The man went to the (old ladies hostel)/bank.
What kind of 'bank'? A river bank or a source of money or blood
bank? Here we have three distinct English words with the
same spelling/pronunciation.
Word sense disambiguation (WSD) is the problem of
determining in which sense a word having a number of distinct
senses is used in a given sentence. So, WSD is a task of
removing the ambiguity of word in context.