In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics analysis and automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which words are inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on statistical based approach, in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus, where the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could reach more than 90% of accuracy for known words.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
This document provides an introduction and background on natural language processing (NLP). It discusses the key categories of linguistic knowledge needed for NLP, including phonetics, morphology, syntax, semantics, pragmatics, and discourse. It also explains that NLP tasks involve resolving ambiguity at these different levels of language. Common models and algorithms used in NLP are described, such as state machines, formal rule systems, logic, and probabilistic models. Machine learning approaches are also discussed for automatically learning NLP representations.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Spotting The Difference–Machine Versus Human TranslationUlatus
Regardless of how much the systems have improved and made worldwide communication easier, there is still no alternative to human translation. Machines can only comply to grammatical accuracy, but the semantic, linguistic, and the cultural completeness in a text can only be achieved by human speakers
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
This document provides an introduction and background on natural language processing (NLP). It discusses the key categories of linguistic knowledge needed for NLP, including phonetics, morphology, syntax, semantics, pragmatics, and discourse. It also explains that NLP tasks involve resolving ambiguity at these different levels of language. Common models and algorithms used in NLP are described, such as state machines, formal rule systems, logic, and probabilistic models. Machine learning approaches are also discussed for automatically learning NLP representations.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Spotting The Difference–Machine Versus Human TranslationUlatus
Regardless of how much the systems have improved and made worldwide communication easier, there is still no alternative to human translation. Machines can only comply to grammatical accuracy, but the semantic, linguistic, and the cultural completeness in a text can only be achieved by human speakers
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
Summer Research Project (Anusaaraka) ReportAnwar Jameel
This document discusses Anusaaraka, a machine translation tool being developed to translate between English and Hindi. It uses principles from Panini's grammar to map word groups and constructions between the languages. Where differences exist, extra notation is added to preserve source language information. The output is presented in layers to show the translation process. It aims to bridge the language barrier by allowing users to access text in their preferred Indian language.
Key features include faithfully representing the source text, reversibility of the translation process through layered output, and transparency by allowing users to trace the translation steps. It was developed by combining traditional Indian linguistic principles with modern technologies.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation
result based on huge amount of aligned parallel text corpora in both the source and target languages.
Although Bodo is a recognized natural language of India and co-official languages of Assam, still the
machine readable information of Bodo language is very low. Therefore, to expand the computerized
information of the language, English to Bodo SMT system has been developed. But this paper mainly
focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using
Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator
tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora.
Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques
in the SMT system.
The document discusses natural language processing (NLP) for Tamil to Hindi conversion. It introduces the Universal Networking Language (UNL) as an intermediate representation to express information across languages. UNL allows text to be converted to different languages like converting a webpage to various natural languages. The document then discusses the advantages of developing machine translation between Tamil and other languages, particularly English and Hindi. It outlines the components needed for a Tamil-Hindi machine translation system, including morphological analyzers for Tamil and Hindi, a word mapping unit, and generators.
This document discusses a project to directly translate Hindi text to Tamil text without an intermediate language like English. It describes using techniques like part-of-speech tagging, statistical machine translation, word sense disambiguation using the Lesk algorithm, and morphological analysis. The goal is to build an architecture that can take Hindi input, perform the necessary NLP techniques, and output the translation in Tamil. References are provided for related work.
This document provides an introduction to machine translation and different approaches to machine translation. It discusses the history of machine translation, beginning in the 1950s. It then describes four main approaches to machine translation: direct machine translation, rule-based machine translation, corpus-based machine translation, and knowledge-based machine translation. For each approach, it provides a brief overview and example. It focuses in more depth on direct machine translation and rule-based machine translation, explaining their process and limitations.
Script to Sentiment : on future of Language TechnologyMysore latestJaganadh Gopinadhan
The document discusses developments in the field of human language technology (HLT) and its future applications. It notes that HLT is no longer confined to academia and is becoming integrated into information and communication technology products and services in daily life. The document provides an overview of developments in text and speech processing, including machine translation systems and spell checkers. It also discusses the role of open source tools and frameworks in advancing research in HLT, particularly for Indian languages.
The document discusses natural language and natural language processing (NLP). It defines natural language as languages used for everyday communication like English, Japanese, and Swahili. NLP is concerned with enabling computers to understand and interpret natural languages. The summary explains that NLP involves morphological, syntactic, semantic, and pragmatic analysis of text to extract meaning and understand context. The goal of NLP is to allow humans to communicate with computers using their own language.
This document provides an overview of natural language processing (NLP). It discusses how NLP analyzes human language input to build computational models of language. The key components of NLP are natural language understanding and natural language generation. Challenges in NLP include ambiguity, context dependence, and the creative nature of language. The document also outlines common NLP techniques like keyword analysis and syntactic parsing, as well as formal grammars and parsing approaches.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing.
The document provides an overview of natural language processing (NLP). It defines NLP as the automatic processing of human language and discusses how NLP relates to fields like linguistics, cognitive science, and computer science. The document also describes common NLP tasks like information extraction, machine translation, and summarization. It discusses challenges in NLP like ambiguity and examines techniques used in NLP like rule-based systems, probabilistic models, and the use of linguistic knowledge.
The document discusses natural language processing and some of the key challenges involved. It describes how NLP systems aim to understand human language in written or spoken form by performing tasks like morphological analysis, parsing, semantic analysis, and discourse processing. It also discusses sources of ambiguity in natural language and different models and algorithms used to represent linguistic knowledge and process language, with the goal of building intelligent systems that can understand human communication.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Natural Language Processing (NLP) is a field of computer science concerned with interactions between computers and human languages. NLP involves understanding written or spoken language at various levels such as morphology, syntax, semantics, and pragmatics. The goal of NLP is to allow computers to understand, generate, and translate between different human languages.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
This document summarizes and reviews various grammar checkers for natural languages. It begins by defining key concepts in natural language processing like computational linguistics and grammar checking. It then describes the general working of grammar checkers, which involves preprocessing text, analyzing morphology and syntax, and identifying grammatical errors. The document surveys grammar checking approaches for several languages like rule-based, statistical, and hybrid methods. Specific grammar checkers are discussed for languages like Afan Oromo, Amharic, Swedish, Icelandic, Nepali, and Portuguese. The review concludes by analyzing the features and limitations of existing grammar checking systems.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
This document discusses different types of language input that can contribute to second language acquisition (SLA). It examines pre-modified input, interactionally modified input, and modified output as three potential sources of comprehensible input according to some researchers. Pre-modified input involves modifying language before learners see or hear it. Interactionally modified input refers to modifications made during interaction with native or proficient non-native speakers. Modified output occurs when learners modify their language in response to input through interaction. The document also discusses other potential types of input like incomprehensible input and comprehensible output that may enhance SLA. It concludes that while language input is important for SLA, different types of input beyond just comprehensible input can support
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with
each other for different using various languages. Machine translation is highly demanding due to increasing the usage of web
based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small
or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding
whether a word is a proper noun or not. Here we have introduce a new approach to identify proper noun from a Bengali sentence
for UNL without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion
to UNL and vice versa with minimal storing word in dictionary.
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
This document provides an introduction to a course on semantic analysis in language technology taught at Uppsala University in Sweden. It outlines the course website, contact information for the instructor, intended learning outcomes, required readings, assignments and examination. The course focuses on applying semantic analysis methods in natural language processing tasks like sentiment analysis, information extraction, word sense disambiguation and predicate-argument extraction. It will introduce students to representing and modeling meaning in language through formal logics and semantic frameworks.
The document provides information about tender notice advertisement booking through www.tendernoticeads.com. It outlines the booking procedure which includes selecting the ad type, emailing the ad content, designing the ad, approving the ad, making payment, and releasing the ad. It encourages booking tender notice ads at discounted rates by contacting Mr. Suraj at the provided phone numbers and emails.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
Summer Research Project (Anusaaraka) ReportAnwar Jameel
This document discusses Anusaaraka, a machine translation tool being developed to translate between English and Hindi. It uses principles from Panini's grammar to map word groups and constructions between the languages. Where differences exist, extra notation is added to preserve source language information. The output is presented in layers to show the translation process. It aims to bridge the language barrier by allowing users to access text in their preferred Indian language.
Key features include faithfully representing the source text, reversibility of the translation process through layered output, and transparency by allowing users to trace the translation steps. It was developed by combining traditional Indian linguistic principles with modern technologies.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation
result based on huge amount of aligned parallel text corpora in both the source and target languages.
Although Bodo is a recognized natural language of India and co-official languages of Assam, still the
machine readable information of Bodo language is very low. Therefore, to expand the computerized
information of the language, English to Bodo SMT system has been developed. But this paper mainly
focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using
Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator
tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora.
Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques
in the SMT system.
The document discusses natural language processing (NLP) for Tamil to Hindi conversion. It introduces the Universal Networking Language (UNL) as an intermediate representation to express information across languages. UNL allows text to be converted to different languages like converting a webpage to various natural languages. The document then discusses the advantages of developing machine translation between Tamil and other languages, particularly English and Hindi. It outlines the components needed for a Tamil-Hindi machine translation system, including morphological analyzers for Tamil and Hindi, a word mapping unit, and generators.
This document discusses a project to directly translate Hindi text to Tamil text without an intermediate language like English. It describes using techniques like part-of-speech tagging, statistical machine translation, word sense disambiguation using the Lesk algorithm, and morphological analysis. The goal is to build an architecture that can take Hindi input, perform the necessary NLP techniques, and output the translation in Tamil. References are provided for related work.
This document provides an introduction to machine translation and different approaches to machine translation. It discusses the history of machine translation, beginning in the 1950s. It then describes four main approaches to machine translation: direct machine translation, rule-based machine translation, corpus-based machine translation, and knowledge-based machine translation. For each approach, it provides a brief overview and example. It focuses in more depth on direct machine translation and rule-based machine translation, explaining their process and limitations.
Script to Sentiment : on future of Language TechnologyMysore latestJaganadh Gopinadhan
The document discusses developments in the field of human language technology (HLT) and its future applications. It notes that HLT is no longer confined to academia and is becoming integrated into information and communication technology products and services in daily life. The document provides an overview of developments in text and speech processing, including machine translation systems and spell checkers. It also discusses the role of open source tools and frameworks in advancing research in HLT, particularly for Indian languages.
The document discusses natural language and natural language processing (NLP). It defines natural language as languages used for everyday communication like English, Japanese, and Swahili. NLP is concerned with enabling computers to understand and interpret natural languages. The summary explains that NLP involves morphological, syntactic, semantic, and pragmatic analysis of text to extract meaning and understand context. The goal of NLP is to allow humans to communicate with computers using their own language.
This document provides an overview of natural language processing (NLP). It discusses how NLP analyzes human language input to build computational models of language. The key components of NLP are natural language understanding and natural language generation. Challenges in NLP include ambiguity, context dependence, and the creative nature of language. The document also outlines common NLP techniques like keyword analysis and syntactic parsing, as well as formal grammars and parsing approaches.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing.
The document provides an overview of natural language processing (NLP). It defines NLP as the automatic processing of human language and discusses how NLP relates to fields like linguistics, cognitive science, and computer science. The document also describes common NLP tasks like information extraction, machine translation, and summarization. It discusses challenges in NLP like ambiguity and examines techniques used in NLP like rule-based systems, probabilistic models, and the use of linguistic knowledge.
The document discusses natural language processing and some of the key challenges involved. It describes how NLP systems aim to understand human language in written or spoken form by performing tasks like morphological analysis, parsing, semantic analysis, and discourse processing. It also discusses sources of ambiguity in natural language and different models and algorithms used to represent linguistic knowledge and process language, with the goal of building intelligent systems that can understand human communication.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Natural Language Processing (NLP) is a field of computer science concerned with interactions between computers and human languages. NLP involves understanding written or spoken language at various levels such as morphology, syntax, semantics, and pragmatics. The goal of NLP is to allow computers to understand, generate, and translate between different human languages.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
This document summarizes and reviews various grammar checkers for natural languages. It begins by defining key concepts in natural language processing like computational linguistics and grammar checking. It then describes the general working of grammar checkers, which involves preprocessing text, analyzing morphology and syntax, and identifying grammatical errors. The document surveys grammar checking approaches for several languages like rule-based, statistical, and hybrid methods. Specific grammar checkers are discussed for languages like Afan Oromo, Amharic, Swedish, Icelandic, Nepali, and Portuguese. The review concludes by analyzing the features and limitations of existing grammar checking systems.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
This document discusses different types of language input that can contribute to second language acquisition (SLA). It examines pre-modified input, interactionally modified input, and modified output as three potential sources of comprehensible input according to some researchers. Pre-modified input involves modifying language before learners see or hear it. Interactionally modified input refers to modifications made during interaction with native or proficient non-native speakers. Modified output occurs when learners modify their language in response to input through interaction. The document also discusses other potential types of input like incomprehensible input and comprehensible output that may enhance SLA. It concludes that while language input is important for SLA, different types of input beyond just comprehensible input can support
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with
each other for different using various languages. Machine translation is highly demanding due to increasing the usage of web
based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small
or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding
whether a word is a proper noun or not. Here we have introduce a new approach to identify proper noun from a Bengali sentence
for UNL without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion
to UNL and vice versa with minimal storing word in dictionary.
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
This document provides an introduction to a course on semantic analysis in language technology taught at Uppsala University in Sweden. It outlines the course website, contact information for the instructor, intended learning outcomes, required readings, assignments and examination. The course focuses on applying semantic analysis methods in natural language processing tasks like sentiment analysis, information extraction, word sense disambiguation and predicate-argument extraction. It will introduce students to representing and modeling meaning in language through formal logics and semantic frameworks.
The document provides information about tender notice advertisement booking through www.tendernoticeads.com. It outlines the booking procedure which includes selecting the ad type, emailing the ad content, designing the ad, approving the ad, making payment, and releasing the ad. It encourages booking tender notice ads at discounted rates by contacting Mr. Suraj at the provided phone numbers and emails.
The document provides information about tender notice advertisement booking through www.tendernoticeads.com. It outlines the booking procedure which includes selecting the ad type, emailing the ad content, designing the ad, approving the ad, making payment, and releasing the ad. It promotes discounted rates for ad booking and provides contact information including phone numbers and email to reach out for booking tender notice advertisements.
The document outlines a business plan for a graphene-based flexible battery called BatteryX. It describes the technical approach using graphene electrodes and highlights improved performance over thin film batteries. The plan discusses key partnerships, activities, resources and revenue streams such as licensing the intellectual property, selling customized batteries, and selling home-use products. Market segments are flexible electronics industries and home consumers. The overall goal is to scale production from R&D to industrial manufacturing levels.
http://bit.ly/1QyBEYy
La Montesa de Marbella is a new, fresh and contemporary off-plan development consisting of just 39 apartments & penthouses. Situated in a privileged location with stunning views of the coast, of Cabopino port and Cabopino golf. The homes are carefully arranged to maximise the outstanding views and are surrounded by beautiful, extensive gardens with a communal swimming pool.
The apartments have south and south-west orientation offering panoramic sea views.
Modern, open plan interiors with floor to ceiling windows to maximise light and views.
Built for your comfort with air conditioning and fully fitted kitchens.
La Unión Europea ha acordado un embargo petrolero contra Rusia en respuesta a la invasión de Ucrania. El embargo prohibirá las importaciones marítimas de petróleo ruso a la UE y pondrá fin a las entregas a través de oleoductos dentro de seis meses. Esta medida forma parte de un sexto paquete de sanciones de la UE destinadas a aumentar la presión económica sobre Moscú y privar al Kremlin de fondos para financiar su guerra.
Mechanical Broom Manufacturers by Atlas IndustriesAtlas Industries
Mechanical Brooms manufactured by ATLAS have a sweeping width of 2.5 meters. These road sweepers are equipped with nylon bristles for efficient and durable cleaning of road.
The document discusses various types of online advertising including display ads, classified listings, rich media ads, click-through advertising, and conversion optimization. It provides details on the history and evolution of online advertising. Key points covered include advantages and drawbacks of different ad formats, metrics like click-through rates, and methods to increase conversion rates.
This is a good and very informative and knowledgeable post about Indian share market. Indian share market is very volatile buy more proficient. It can provide many ways for making profit and earn handsome money by the guidance for broker & advisers.
Markets opened on a flat note but after
an initial surge they couldn’t made
headway and closed near the key
support level for the second consecutive
session.
IPC creates content for over 60 media brands across various platforms including print, online, mobile, and events. They engage with over 26 million UK adults through their portfolio of magazines. While involved in various genres, their involvement in music magazines is minimal with only two brands listed. Similarly, Bauer Media owns over 300 magazines worldwide and has a UK division with magazines and radio brands, but only publishes two music magazine titles, leaving opportunities for other music magazine genres. TeamRock Media focuses on rock music content across magazines, radio, and online to meet demand not being met by other publishers, demonstrating a strong focus on the rock music genre.
Aasaan provides an end-to-end solution for all of a company's office needs from ordering supplies to delivery. They have a dedicated team to cater to enterprise customers' requirements and provide the best possible solution. Aasaan streamlines the functions of the office administrator to save time and costs for companies while providing comprehensive decision making, analytics, and a wide variety of products and vendors.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...kevig
Morphological analyzer is the basic for various high level NLP applications such as information retrieval, spell checking, grammar checking, machine translation, speech recognition, POS tagging and automatic sentence construction. This paper is carefully designed for design and analysis of morphological analyzer Tigrigna verbs using hybrid of memory learning and rules based approaches. The experiment have conducted using python 3 where TiMBL algorithms IB2 and TRIBL2, and Finite State Transducer rules are used. The performance of the system has been evaluated using 10 fold cross validation technique. Testing conducted using optimized parameter settings for regular verbs and linguistic rules of the Tigrigna language allomorph and phonology for the irregular verbs. The accuracy on the memory based approach with optimized parameters of TiMBL algorithm IB2 and TRIBL2 was 93.24% and 92.31%, respectively. Finally, the hybrid approach had the actual performance of 95.6% using linguistic rules for handling irregular and copula verbs.
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...kevig
Morphological analyzer is the base for various high-level NLP applications such as information retrieval,
spell checking, grammar checking, machine translation, speech recognition, POS tagging and automatic
sentence construction. This paper is carefully designed for design and analysis of morphological analyzer
Tigrigna verbs using hybrid of memory learning and rules based approaches. The experiment have
conducted using Python 3 where TiMBL algorithms IB2 and TRIBL2, and Finite State Transducer rules
are used. The performance of the system has been evaluated using 10 fold cross validation technique.
Testing was conducted using optimized parameter settings for regular verbs and linguistic rules of the
Tigrigna language allomorph and phonology for the irregular verbs. The accuracy of the memory based
approach with optimized parameters of TiMBL algorithm IB2 and TRIBL2 was 93.24% and 92.31%,
respectively. Finally, the hybrid approach had an actual performance of 95.6% using linguistic rules for
handling irregular and copula verbs.
Substitution Error Analysis for Improving the Word Accuracy in Telugu Langua...IOSR Journals
This document discusses substitution error analysis to improve word accuracy in an automatic speech recognition system for the Telugu language. It analyzes the performance of an ASR system using two different lexical models - one based on a stress-timed language (CMU) and the other a handcrafted lexicon for syllable-timed Telugu. The effect of gender, accents, and pronunciation variants on substitution errors is studied. Confusion matrices of vowels and consonants show the most common phoneme substitutions for each case. The Telugu-based lexicon improves word accuracy by 20-30% over the CMU-based system.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
We start with a linguistic discussion of language, its properties, and the study of language in philosophy and linguistics. We then investigate natural languages, controlled languages, and artificial languages to emphasise the human ability to control and construct languages. At the end, we arrive at the notion of software languages as means to communicate software between people.
This document discusses improving the word accuracy of an automatic speech recognition (ASR) system for the Telugu language. It analyzes the substitution errors in the system using two different lexical models - one based on stress-timed English phonemes (CMU lexicon) and one handcrafted lexicon for syllable-timed Telugu (UOH lexicon). The UOH lexicon improves word accuracy by 20-30% compared to the CMU lexicon by better modeling the phonetic characteristics of Telugu. The paper also examines the effect of gender, accents, and non-native speakers on substitution errors and the resulting confusion matrices provide insight into the most commonly substituted phonemes.
Natural Language Processing and Language Learningantonellarose
NLP can be used in two main ways for language learning: 1) To analyze learner language produced in response to exercises to provide feedback on errors, and 2) To analyze native language materials to provide learners with relevant examples and generate exercises. For the first use, NLP is needed to automatically analyze learner responses when there are too many possible responses to specify feedback for each one. NLP identifies common properties and errors rather than analyzing specific strings. For the second use, NLP supports finding and presenting native language materials and generating exercises based on them.
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...Syeful Islam
Abstract—Now-a-days hundreds of millions of people of
almost all levels of education and attitudes from different
country communicate with each other for different
purposes and perform their jobs on internet or other
communication medium using various languages. Not all
people know all language; therefore it is very difficult to
communicate or works on various languages. In this
situation the computer scientist introduce various inter
language translation program (Machine translation). UNL
is such kind of inter language translation program. One of
the major problem of UNL is identified a name from a
sentence, which is relatively simple in English language,
because such entities start with a capital letter. In Bangla
we do not have concept of small or capital letters. Thus
we find difficulties in understanding whether a word is a
proper noun or not. Here we have proposed analysis rules
to identify proper noun from a sentence and established
post converter which translate the name entity from
Bangla to UNL. The goal is to make possible Bangla
sentence conversion to UNL and vice versa. UNL system
prove that the theoretical analysis of our proposed system
able to identify proper noun from Bangla sentence and
produce relative Universal word for UNL.
Natural Language Processing: State of The Art, Current Trends and Challengesantonellarose
Diksha Khurana1
, Aditya Koli1
, Kiran Khatter1,2 and Sukhdev Singh1,2
1Department of Computer Science and Engineering
Manav Rachna International University, Faridabad-121004, India
2Accendere Knowledge Management Services Pvt. Ltd., India
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from
different country communicate with each other for different using various languages. Machine
translation is highly demanding due to increasing the usage of web based Communication. One of
the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we
do not have concept of small or capital letters and there is huge no. of different naming entity
available in Bangla. Thus we find difficulties in understanding whether a word is a naming word
(proper noun) or not. Here we have introduced a new approach to identify naming word from a
Bengali sentence for UNL without storing huge no. of naming entity in word dictionary. The goal is
to make possible Bangla sentence conversion to UNL and vice versa with minimal storing word in
dictionary.
Linguistics is the scientific study of language and its properties. It seeks to answer fundamental questions about the nature of language, including what language is, how it is used, and its core components. The main areas of linguistics include phonetics, phonology, morphology, syntax, semantics, and pragmatics. Linguistics helps us better understand language acquisition, speech therapy, computational linguistics, language teaching, and the analysis of literature. Studying linguistics provides insights into human communication abilities and how they can be applied.
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Waqas Tariq
The Universal Networking Language (UNL) deals with the communication across nations of different languages and involves with many different related discipline such as linguistics, epistemology, computer science etc. It helps to overcome the language barrier among people of different nations to solve problems emerging from current globalization trends and geopolitical interdependence. We are working to include Bangla language in the UNL system so that Bangla language can be converted to UNL expressions. As a part of this process currently we are working on Bangla Consonant Ended Verb Roots and trying to develop lexical or dictionary entries for the Consonant Ended Verb Roots. In this paper, we have presented our work by describing Bangla verb, Verb root, Verbal Inflections and then finally showed the dictionary entries for the consonant ended roots.
The Ekegusii Determiner Phrase Analysis in the Minimalist ProgramBasweti Nobert
Among some of the recent syntactic developments, the noun phrase has been reanalyzed
as a determiner phrase (DP). This study analyses the Ekegusii determiner
phrase (DP) with an inquiry into the relationship between agreement of the INFL
(sentence) and concord in the noun phrase (determiner phrase). It hypothesizes that
the Ekegusii sentential Agreement has a symmetrical relationship with the Ekegusii
Determiner Phrase internal concord and feature checking theory and full
interpretation (FI) in the Minimalist Program is adequate in the analysis of the
internal structure of the Ekegusii DP. In employing the Minimalist Program (MP),
the study shall first seek to establish the domain of the NP in the Ekegusii DP and
go ahead to do an investigation into the adequacy of the Minimalist Program in
analyzing the Ekegusii DP. This study is also geared towards establishing the order
of determiners in the DP between the D-head and the NP complement. The study
concludes that the principles of feature checking and full interpretation in the
minimalist program are mutually crucial in ensuring that Ekegusii constructions (DP
and even the sentence) are grammatical (converge). This emphasizes the fact that
the MP is adequate in Ekegusii DP analysis.
Smart grammar a dynamic spoken language understanding grammar for inflective ...ijnlc
1. The document proposes SmartGrammar, a new method for developing spoken language understanding grammars for inflectional languages like Italian.
2. SmartGrammar uses a morphological analyzer to convert user utterances into their canonical forms before parsing, allowing the grammar to contain only canonical word forms rather than all possible inflections.
3. This significantly reduces the complexity and size of grammars for inflectional languages by representing many possible inflected forms with a single canonical form entry, making grammar development and management easier.
Syracuse UniversitySURFACEThe School of Information Studie.docxdeanmtaylor1545
Syracuse University
SURFACE
The School of Information Studies Faculty
Scholarship
School of Information Studies (iSchool)
2001
Natural Language Processing
Elizabeth D. Liddy
Syracuse University, [email protected]
Follow this and additional works at: http://surface.syr.edu/istpub
Part of the Library and Information Science Commons, and the Linguistics Commons
This Book Chapter is brought to you for free and open access by the School of Information Studies (iSchool) at SURFACE. It has been accepted for
inclusion in The School of Information Studies Faculty Scholarship by an authorized administrator of SURFACE. For more information, please contact
[email protected]
Recommended Citation
Liddy, E.D. 2001. Natural Language Processing. In Encyclopedia of Library and Information Science, 2nd Ed. NY. Marcel Decker, Inc.
http://surface.syr.edu?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/ischool?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://network.bepress.com/hgg/discipline/1018?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://network.bepress.com/hgg/discipline/371?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
mailto:[email protected]
Natural Language Processing
1
INTRODUCTION
Natural Language Processing (NLP) is the computerized approach to analyzing text that
is based on both a set of theories and a set of technologies. And, being a very active area
of research and development, there is not a single agreed-upon definition that would
satisfy everyone, but there are some aspects, which would be part of any knowledgeable
person’s definition. The definition I offer is:
Definition: Natural Language Processing is a theoretically motivated range of
computational techniques for analyzing and representing naturally occurring texts
at one or more levels of linguistic analysis for the purpose of achieving human-like
language processing for a range of tasks or applications.
Several elements of this definition can be further detailed. Firstly the imprecise notion of
‘range of computational techniques’ is necessary because there are multiple methods or
techniques from which to choose to accomplish a particular type of language analysis.
‘Naturally occurring texts’ can be of any language, mode, genre, etc. The texts can be
oral or written. The only requirement is that they be in a language used by humans to
communicate to one another. Also, the text being analyzed should not be specifically
constru.
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACHijnlc
Part of speech tagging is very important and the in
itial work towards machine translation and text
manipulation. Though much has been done in this reg
ard to the Indo- European and Asiatic languages,
development of part of speech tagging tools for Afr
ican languages is wanting. As a result, these lang
uages
are classified as under resourced languages.
This paper presents data driven part of speech tagg
ing tools for kikamba which is an under resourced
language spoken mostly in Machakos, Makueni and Kit
ui. The tool is made using the lazy learner called
Memory Based Tagger (MBT) with approximately thirty
thousand word corpuses. The corpus is collected,
cleaned and formatted with regard to MBT and experi
ment run.
Very encouraging performance is reported despite li
ttle amount of corpus, which clearly shows that us
ing
the state of art technology of data driven methods
tools can be developed for under resourced language
s.
We report a precision of 83%, recall of 72% and F-s
core of 75% and in terms of accuracy for the known
and unknown words, and accuracy of 94.65% and71.93%
respectively with overall accuracy of
90.68%..This predicts that with little source of co
rpus using data driven approach, we can generate to
ols
for the under resourced languages in Kenya.
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
Role of language engineering to preserve endangered languagesDr. Amit Kumar Jha
Role of Language Engineering to Preserve Endangered Languages discusses how language engineering can help preserve endangered languages through documentation and digitization. Language engineering is the application of computer science to develop language-related software and hardware. It involves techniques like speech and text processing to develop systems that can understand, interpret, and generate human language. Documenting endangered languages through recording speech samples and collecting texts is important for preservation. Language engineering makes this documentation process easier through tools like speech-to-text, text-to-speech, and transcription tools. It also allows for digital storage of language data, which helps preserve languages for longer as digital data is more durable than other forms of storage. Developing applications that use endangered languages, like translation systems,
The Input Learner Learners Forward Throughout...Tiffany Sandoval
This document provides an analysis of Robert Frost's poem "Stopping by Woods on a Snowy Evening" through a linguistic and stylistic lens. It introduces stylistics as the study of appropriate language use and style in writing. The analysis will examine Frost's style and how it shapes the interpretation of the poem. It describes Frost as an American poet known for his philosophical poetry dealing with existential questions about life, death, and humanity's place in the universe. The analysis will observe Frost's style in this particular poem.
Similar to Hidden markov model based part of speech tagger for sinhala language (20)
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Hidden markov model based part of speech tagger for sinhala language
1. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
10.5121/ijnlc.2014.3302 9
HIDDEN MARKOV MODEL BASED PART OF SPEECH
TAGGER FOR SINHALA LANGUAGE
A.J.P.M.P. Jayaweera1
and N.G.J. Dias2
1
Virtusa (Pvt.) Ltd, No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka
2
Department of Statistics & Computer Science, University of Kelaniaya,
Kelaniya, Sri Lanka
ABSTRACT
In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model
(HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing
task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the
dynamics of the language, which the knowledge could utilized in computational linguistics analysis and
automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which
words are inflected with various grammatical features, tagging is very essential for further analysis of the
language. Our research is based on statistical based approach, in which the tagging process is done by
computing the tag sequence probability and the word-likelihood probability from the given corpus, where
the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could
reach more than 90% of accuracy for known words.
KEYWORDS
Part of Speech tagging, Morphology, Natural Language Processing, Hidden Markov Model, Stochastic
based tagging
1. INTRODUCTION
According to figures from UNESCO (The United Nations’ Educational, Scientific and Cultural
Organization), there are around 6900 spoken languages are exist in this world, only 20 languages
are spoken by 50% of the world population. Each of these languages are spoken by more than 50
million speakers. Most of the world population speaks Chinese Mandarin and that is spoken by
around 1000 million people. Spanish, English, Hindi, Arabic, Portuguese and Russian are other
top most languages spoken by the largest population in this world, and each language is spoken
by 200 million speakers or more. People who speak those top most languages are spread across
different geographical regions in multiple countries. Also 50% of the languages are endangered
and most of them are spoken by small communities and they are always limited to a specific
geographical region [1, 2, 3].
Sinhala is also one unique language that speaks only by people in Sri Lanka and more than 17
million speakers speak Sinhala as their mother tongue. We believe that Sinhala is not an
endangered language yet, though speakers are limited only to a small geographical region. But we
think our mother language need more attention, and need to get more provision to develop the
language with latest technology trends. So our effort here is to address one pitfall that we have
identified in area of computational linguistics and Natural Language Processing (NLP) related to
Sinhala language.
2. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
10
Though research on NLP, has taken giant leap in the last two decades with the advent of efficient
machine learning algorithms and the creation of large annotated corpora for various languages,
only few languages in the world have the advantage of having enough lexical resources, such as
English. NLP researches for Sinhala are still far behind than other South Asian languages. Further
we have very limited lexical resources available for Sinhala language. Researches on NLP for
Sinhala language can be pushed by creation of required lexical resources and tools. So, the
attempt of this research is to develop a Part of speech Tagger for Sinhala language, which is a
fundamental need for further computational linguistic analysis for our mother language.
Sinhala is a complex language, morphologically rich and agglutinative in nature, words of which
are inflected with various grammatical features. Sinhala root noun (lemma) inflects for plural and
singular and Sinhala verb specifies almost everything like gender, number and person markings,
and represents the tense of the activity.
POS tagging is a well-studied problem in the field of NLP and one of the fundamental processing
step for any language in NLP and language automation, i.e., the capability of a computer to
automatically POS tag a given sentence. Throughout the history of NLP, different approaches
have already been tried out to automate the task of POS tagging of languages such as English,
German, Chinese and few South Asian languages such as Hindi, Tamil and Bengali.
Words are the fundamental building block of a language. Every human language, spoken, signed
or written is composed of words [7]. Every area of speech and language processing, from speech
recognition to machine translation, text to speech, spelling and grammar checking to language-
based information retrieval on the Web, requires extensive knowledge about words that are
heavily based on the lexical knowledge. In contrast to other data processing systems, language
processing applications use knowledge of the language.
The basic processing step in tagging consists of assigning POS tags to every token in the text with
a corresponding POS tag like noun, verb, preposition, etc., based both on its definition, as well as
its context. The number of part of speech tags in a tagger may vary depending on the information
one wants to capture [7].
In this paper, we present a fundamental lexical and morphological analysis of Sinhala language,
theory of Hidden Markov Model and the algorithm of the implementation. Section 2 of this paper
gives an idea of history and previous research on NLP and section 3 discusses previous work on
Sinhala language. Section 4 and 5 give a comprehensive lexical and morphological analysis of
Sinhala language. Section 6 and 7 give details about available lexical resources which we use in
this research. Section 8 and 9 describe POS tagging and the Hidden Markov Model
implementation algorithm. Section 10 and 11 discuss the Evaluation, testing and the result, and
section 12 concludes the paper and describes the future work.
2. PREVIOUS WORK ON NLP
Natural language processing history started from Shanon (1948), Kleen (1951) then Chomsky
(1956) to Harris (1959), they contributed a lot in early 1950s to formulate the basic concepts and
principles of language processing. In the last 50 years of research in language processing, various
kinds of knowledge had been captured through the use of small number of formal models or
theories. Most of these models and theories are all extracted from the standard toolkit of
Computer Science, Mathematics and Linguistics. Among the most important elements in these
toolkits are state machine, formal rules system, logic as well as probability theory and other
machine learning tools [7]. But in the last decade, probabilistic and data-driven models had
become quite standard throughout the natural language processing.
3. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
11
For English, there are many POS taggers available: employing machine learning techniques
(based on Hidden Markov Models [15]), transformation based error driven learning [10], decision
trees [9] and maximum entropy methods [6]. There are some taggers which are hybrid using both
stochastic and rule-based approaches. Most of the POS taggers have reached a success, between
92-97 % accuracy. However, these accuracies are aided by the availability of large annotated
corpus for English. Further there are few Tagging systems available for South Asian languages
like Hindi, Tamil and Bengali [8, 12, 13, 14]. In 2006, a POS tagger was proposed for Hindi,
which uses an annotated corpus of 15,562 words and a decision tree based learning algorithm.
They reached an accuracy of 93.45% with a tag set of 23 POS tags [14]. For Bengali, a tagger was
developed using a corpus based semi-supervised learning algorithm based on HMMs [13].
3. PREVIOUS WORK ON SINHALA NLP ANALYSIS
There were some important language analysis work has done for Sinhala language, and created a
Tag set [16] and a corpus of one million words [17], which was an important initiative, that gives
a substantial influence to perform NLP research on Sinhala language. But unfortunately, the
progress of computational linguistic analysis on Sinhala language is far behind than other
languages. According to our knowledge, there is no well-known automated POS tagging system
available for Sinhala language.
4. MORPHOLOGY IN SINHALA LANGUAGE
Sinhala is morphologically rich and agglutinative language, in which root words are inflected in
different contexts. In Sinhala, words are defined as written stream of letters forming a sensible
understanding to a person that denotes or relation to the physical world or to an abstract concept.
Basic building blocks of Sinhala words are also sound units not the letters, same as English
language, which distinguish two broad classes of morphemes: lemma and affixes . The lemma
(stem) is the “main” morpheme of the word, supplying the main meaning, while the affixes add
“additional” meaning of various kinds. Often Sinhala words are postpositionally inflected with
various grammatical features. Sinhala verb inflects to specifying almost everything like gender,
singularity or plurality, person markings and represents the tense. Sinhala nouns inflect and
specifying singularity or plurality, gender, person marking and case of the noun [18].
According to tradition, below are four main types of words exist in Sinhala language [4, 5]:
1. Noun - kdu mo.
2. Verb - ls%hd mo.
3. Upasarga – Wmi¾. mo (no direct matching with English grammar)
4. Nipatha – ksmd; mo (no direct matching with English grammar)
5. SINHALA WORD CLASSES
Traditionally the definition of POS has been based on morphological and syntactic functions [7].
Similar to most of other languages, POS in Sinhala language also can be divided into two broad
categories: closed class type and open class type. Closed classes are those that have relatively
fixed membership. Closed class words are generally function words: which tend to be very short,
occur frequently, and play an important role in grammar. By contrast open class is the type that
lager numbers of words are belongs in any language, and new words are continually coined or
borrowed from other languages. The words that are usually containing main content of a sentence
are belonged to open word class category.
4. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
12
In Sinhala, all Nouns and Verbs can be categorized under open word class. But Nipatha and
Upasarga behave differently in Sinhala grammar. Words belong to Nipatha and Upasarga are not
changed according to time and gender, Upasarga always join with nouns and provide additional
(improve) meaning to the noun, therefore, Upasarga are not categorized under any of word
classes, but Nipatha can be categorized as closed class words based on their existence.
In-addition to that, Sinhala Pronouns also can be classified as open class words, based on their
morphological properties, but also Pronouns can be classified as closed class words, based on
their existence of fixed membership in the language. Sinhala Pronouns are forms of noun
commonly referring to person, place or things [11].
6. POS TAG SET FOR SINHALA LANGUAGE
In Table I, presents the Tag set defined for Sinhala language, which was developed by UCSC
under PAN Localization project in 2005 [16], and this tag set contains 26 tags which are mostly
based on morphological and syntactical features of Sinhala language. Currently this is the only
tag set available for Sinhala Language, and we use this tag set in our research.
However, there are few issues that the authors have encountered during the process of defining
the tag set, based on the syntactical complexity of Sinhala Language [16]:
1. Separation of Participle1
and Post-positions2
.
2. Separation of Compound Nouns - Combination of multiple nouns act as a single noun.
3. Multiword - Certain word combination/phrases can function as one grammatical
category.
Table 1 Sinhala Tag Set
Tag Description
1 NNR Common Noun Root
2 NNM Common Noun Masculine
3 NNF Common Noun Feminine
4 NNN Common Noun Neuter
5 NNPA Proper Noun Animate
6 NNPI Proper Noun Inanimate
7 PRPM Pronoun Masculine
8 PRPF Pronoun Feminine
9 PRPN Pronoun Neuter
10 PRPC Pronoun Common
11 QFNUM Number Quantifier
12 DET Determiner
13 JJ Adjective
14 RB Adverb
15 RP Particle
1
Particle is a word that resembles a preposition.
2
By definition a post-position follows a noun or a noun phrase.
5. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
13
Tag Description
16 VFM Verb Finite Main
17 VNF Verb Non Finite
18 VP Verb Ptharticiple
19 VNN Verbal Non Finite Noun
20 POST Postpositions
21 CC Conjunctions
22 NVB Noun in Kriya Mula
23 JVB Adjective in Kriya Mula
24 UH Interjection
25 FRW Foreign Word
26 SYM Not Classified
7. SINHALA TEXT CORPUS
Corpus is also an important lexical resource in the field of NLP. In this research we use the Beta
version of the Corpus developed by the UCSC under PAN Localization project in 2005 [17],
which contains around 650 000 words and out of which 70000 distinct words, that comprise of
data drawn from different kinds of Sinhala newspaper articles.
8. POS TAGGING
Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to
each word in a sentence [7]. The input to a tagging algorithm is a string of words and a tag set.
The output is a single best tag for each word. For example, here is a sample sentence from Sinhala
Text Corpus of a news report about “Silsamadana on a Wesak poya day” which each word tagged
with mapping tag using the tag set defined in Table I.
Example: වෙසක්_NNPI ව ෝය_NNN නිමිත්වෙන්_POST මැයි_NNPI 2_NUM ෙැනි_QFNUM
දා_NNN ැෙැති_VP ශීල_JJ ව්යා ාරයට_NNN ද_RP වදසීයක්_QFNUM මණ_RP
පිරිසක්_NNM සහභාගි_NVB වූහ_VFM ._. [Refer the Sinhala glossary for meaning of Sinhala
words]
Sinhala is a morphologically rich and agglutinative language, which words are made up of lexical
roots combined with affixes or prefixes. So automatically assigning a tag to each word in a
language like Sinhala is very complex. The main challenge in Sinhala POS tagging is solving the
complexity of words. Ambiguity is also adding some complexity in the process of tagging, but
fortunately most words in Sinhala are unambiguous. The example below shows how a word can
be ambiguous in Sinhala language.
Ambiguity is the existence of more than one possible usage of POS in different context. The noun
bínd (ibba) containing two meanings tortoise and padlock bear intimateness and inanimateness
according to the context. The problem of POS tagging is to resolve these ambiguities, choosing
the proper tag for the context, not for the word. So this can be resolved by looking at the
associated words with the word.
6. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
14
Example:
bífnd_NNPA fokAfklA bkAkjd (There are two tortoises).
bífnd_NNPI fokAfklA od,d ;shkjd (Two padlocks have been fixed).
Most of tagging algorithms fall into one of the two classes: rule-based taggers and stochastic
taggers [7]. Rule-based taggers generally involve a large database of hand written disambiguation
rules. Stochastic tagger generally resolves tagging ambiguities by using a training corpus to
compute the probability of a given word having a given tag in a given context. In addition to that,
there are taggers, which use a hybrid approach, which employees both of the above methods to
resolve the tagging ambiguity, which is called transformation-based taggers or Brill taggers [7].
Under this research we have tried out the applicability of Stochastic based tagging approach for
Sinhala language.
9. OUR APPROACH
We next describe the approach and the overall application architecture defined for Sinhala POS
tagger in this Research. To find a suitable tagging approach for Sinhala language, we analysed
multiple approaches that has already been discussed for other morphologically rich languages and
decided to use a well-known stochastic based approach which is known as Hidden Markov Model
that has proven evidence of better results for other languages. Probability is the basic principle
behind HMM. The model described here follows the concepts given in the reference [7].
The intuition behind all stochastic taggers is simple generalization of the “pick the most-likely tag
for this word”. For a given sentence or a word sequence, HMM tagger chooses the tag sequence
that maximizes:
P (word | tag) * P (tag | previous n tags).
HMM tagger generally chooses a tag sequence for a given sentence rather than for a single word.
This approach assumes that we are trying to compute the most probable tag sequence of tags T
= (t1,t2,…,tn) for a given sequence of words in the sentence W = (w1,w2,…,wn):
where,
is the set of values of t for which P(T|W) attains its maximum value.
By Bayes law, P (T|W) can be expressed as
So we choose the sequence of tags that gives
where,
,
,
.
7. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
15
is the set of values of t for which (P(T)P(W|T)/P(W)) attains its maximum value.
Since we are looking for the most likely tag sequence for a sentence given a particular word
sequence, the probability of the word sequence P (W) will be same for each tag sequence and we
can ignore it. So we get
where, P(T) is the Prior probability and P(W|T) is the Likelihood probability.
From the chain rule of probability, we get
But for a long sequence of words, calculating probabilities like P (wi|w1t1…wi-1ti-1ti)*P
(ti|w1t1…wi-1ti-1) is not an easy task, there is no easy way to calculate probability for selecting tag
to a word given a long sequence of preceding words. We could solve this problem by making
useful simplification: we approximate the probability of a word given all previous words. The
probability of the word given the single previous word called bigram model. Bigram model
approximates the probability of a word given all the previous words by the conditional probability
of the preceding word.
This assumption that the probability of a word depends only on the previous words is called
Markov assumption. Markov models are the class of probabilistic models that assume that we can
predict the probability of some future unit without looking too far into the past. We can generalize
the bigram to the trigram which looks two words into the past [7].
In practice, trigram model is always used in NLP applications. So that let us define the
simplifying assumptions for this scenario.
First make the assumption that the probability of a word depends only on its tag, i.e.,
Next, we make the assumption that the tag history can be approximated by the most recent two
tags
From (1), (2) and (3), we get
Thus, the best tag sequence can be choose, so that it maximize
(1)
(2)
(3)
.
.
,
8. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
16
Now, as usual we can use maximum likelihood estimation from relative frequency to compute
these probabilities. We use corpus to find counts of tag sequences of tags ti-2,ti-1,ti and tags ti-2,ti-
1, where ti is the tag i and ti-1,ti-2 are previous two tags, and count of witi, where wi is the word i and
ti is the tag assigned to word i.
We compute the probabilities
and
for all wi , where 1≤i≤n.
9.1 The Algorithm
The algorithm explains below is based on the Viterbi Algorithm [7], which is widely used in the
NLP applications, that allows considering all the words in the given sentence simultaneously and
computes the most likely tag sequence. More formally, the algorithm searches for the best tag
sequence for given an observation sequence W = (w1,w2,…,wn) based on the text corpus. Each
cell viterbi[t,i] (a two dimension array with i*j elements) of the matrix contains the probability of
the path which contains the probability for the first t observations ends in state i. This is the most-
probable path out of all possible sequence of the tags of length t-1.
The algorithm sets up a probability matrix, with one column for each observation index (t) and
one row for each state (i) in the state graph. The algorithm first creates t+2 columns. The first
column is the initial observation, which is the start of the sequence, then next corresponds to the
first observation, and so on. Then begin with the first column by setting the probability of the start
to 1, and other probabilities to 0. For each column of the matrix, that is, for each time index t,
each cell viterbi[t,j] will contain the probability of the most likely path to end in that cell j. We
calculate this probability recursively, by maximizing over the probability of the coming from all
possible preceding states. Then we move to the next state; for each of the state i, viterbi[0,i] in
column 0, then compute the probability of moving into each of the cell j viterbi[1,j] in column 1,
and finally, the probability for the best path will appear in the final column. Finally back tracing
can be done to find the path that gives the best possible tag sequence.
9.2 Overall Application Architecture and the Design
Figure 1 shows the overall architecture of the proposed tagger, which is a two-step process that
first runs through the tagged corpus and extract the linguistic knowledge. Then it runs through the
row text inputs and generating the best tag sequence for the sequence of input words based on the
knowledge that gathered from the corpus.
Lexical Parser: Checks boundary conditions of each sentences and words as defined in the
lexical rules, and prepare for Tokenizing and Pre-processing.
Tokenization: Run through the tagged corpus, separate out the words and tags, prepare for
probability calculation.
Probability Calculation: Calculate the Transition probability and the observation likelihood
probability for each pairs of Words, Tag sequences in the corpus as explained in section 9.3
below.
9. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
17
Viterbi Matrix Analyzer: Prepare a state graph that has all possible state transitions for the
given text input, calculate and assign state transition probability for each transition in the matrix,
as explained in section A.
Tag Sequence Analyzer: Back trace the viterbi matrix, analyse the maximum probability path
and assign tags to each word in the sentence based on highest probability.
9.3 Train the Tagger
The next important step is training the tagger. The training method we describe here is based on
supervised learning approach. It runs on the corpus, makes use of tagged data and estimates the
probabilities of transition, P(tag | previous tag) and observation likelihood P(word | tag) for the
HMM.
Then the transition probability P(ti|ti-1) is calculated simply by using the following formula.
where c(ti-1ti) is the count of tag sequence ti-1,ti in the corpus.
For calculating observation likelihood probability P(wi|ti), we calculate the unigram (unigram
model uses only one piece of information, which is the one that is considering) of a word along
with its tag assigned in the tagged data. The likelihood probability is calculated simply by the
following formula.
where c(tiwi) is the count of word i (wi) is assigned tag i (ti) in the corpus.
Figure 1. The Architecture of the Tagger
,
,
10. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
18
10. EVALUATION
The evaluation of the system was mainly driven by train the system using Sinhala text corpus,
which comprised of 2754 sentences and 90551 words. The training data set was selected from the
Sinhala text corpus developed by UCSC and we used only articles which drown from various
Sinhala newspapers. Example given below shows a part of the corpus, in which each word is
annotated with corresponding mapping tag.
Example: ඇරිස්ටයිඩ්_NNPA සිය_PRP පාලනයේ_NNN අවසන්_JVB කාලයේ_NNN දී_VNF
කියා_VNF සිටි_VP පරිදි_POST ,_, ඔහුයේ_PRP ජනාධිපතිත්වයේ_NNN ඉතිරි_NVB කාල_NNN
සීමාව_NNN තුළ_POST විරුද්ධවාදීන්_NNM සමඟ_CC බලය_NNN යබදා_VNF ගැනීයේ_VNN
පදනමක්_NNN මත_POST ,_, අර්බුදය_NNN යේ_DET තරමට_NNN යමෝරා_VNF ඒයමන්_VNN
වළකා_JVB ගත_VP හැකිව_? තිබිණි_VFM ._.
The testing was performed on a test data extracted from the corpus, and accuracy was calculated
using number of correct tags proposed by the system and total number of words in the sentence/s,
by the following formula.
The results were obtained by performing a cross validation over the corpus. The accuracy for
known and unknown words was also measured separately.
11. RESULT AND DISCUSSION
Testing was done under two classifications: first, tested only with known words (which is already
tagged and the tagger is trained), that gives a very high accuracy close to 95%, secondly tested the
data set with few unknown words and that gives a less accuracy. The tagger doesn’t perform after
reaching an unknown word.
Table 2 contains part of test results that were obtained by performing tests for evaluating known
word scenarios. Actual and predicted tag assignment for each word in the sentences is shown in
the table.
Table 3 below presents the confusion matrix, which summarized the test results given in Table 2.
In this confusion matrix, all correct predictions are located in the diagonal of the table. Only one
tag assignment has deviated from the actual out of 9 actual NNN tag assignments, system has
predicted NNN tags for 7 words, NVB tag was assigned for other two words. In this case, the
accuracy of the system has reached to 90.91% for known words scenarios.
Hence, increasing the size of the training corpus is required to increase the tagging accuracy. Not
only that, it is required to include data from a wide range of domains that makes the corpus more
unbiased and representative, and also further research are required in increasing and optimizing
the tagging accuracy for known words scenarios.
Further, tagging data with unknown words is also an essential need to handle in the tagger. When
the system reach an unknown word, current tagger fails to propose a tag, since the system is not
trained for that word and the tagging algorithm doesn’t have enough intelligence to propose tags
for untrained words. So improvements can be suggested to the algorithm by extracting knowledge
mainly from open class word category, since new words are coined or browed from other
languages more commonly belongs to open word class. Due to fixed number of membership of
11. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
19
closed class word category, we can assume that the words belongs to closed class category are
well defined in Sinhala grammar and that is fixed. So improvements of the algorithm can be
suggested to focus more on words belongs to sub categories of open class words, such as noun,
verbs and pronouns. This could be done by incurring some intelligence to the tagger by set of
hand written disambiguation rules, and follow the hybrid approach in the tagging algorithm.
Table 2. Test Data
Test Data
1
Predicted: ලාාංකිකයින්ට _NNM ආයාචනා _NNN !_.
Actual: ලාාංකිකයින්ට _NNM ආයාචනා _NNN !_.
2
Predicted: බ්රිතාන්ය _NNPI ජාතිකයන් _NNM පස් _QFNUM යදනා
_NNM නිදහස් _NVB !_.
Actual: බ්රිතාන්ය _NNPI ජාතිකයන් _NNM පස් _QFNUM යදනා
_NNM නිදහස් _NNN !_.
3
Predicted: නිදහස් _NNN සන්ධානයයන් _NNN ශ්රී _NNPI
ලාාංකිකයින්ට _NNM ආයාචනා _NVB !_.
Actual: නිදහස් _NNN සන්ධානයයන් _NNN ශ්රී _NNPI
ලාාංකිකයින්ට _NNM ආයාචනා _NNN !_.
4
Predicted: බ්රිතාන්ය _NNPI මහ _JJ යකොමසාරිස් _NNM උතුරට
_NNN යයි _VFM ._.
Actual: බ්රිතාන්ය _NNPI මහ _JJ යකොමසාරිස් _NNM උතුරට
_NNN යයි _VFM ._.
5
Predicted: දිේ _JJ විජයයන් _NNN ධර්ම _NNN විජය _NNN කරන
_VP ඇෆ්ගන් _NNPI තයේබාන්වරු _NNPA ._.
Actual: දිේ _JJ විජයයන් _NNN ධර්ම _NNN විජය _NNN කරන
_VP ඇෆ්ගන් _NNPI තයේබාන්වරු _NNPA ._.
Further, our research opens more areas to continue researches on tagging Sinhala language, which
leads more work to be carried out on finding optimization techniques and unknown word
handling approaches.
Table 3. Confusion Matrix of the Test Result
Predicted
NNM NNN NNPI NVB JJ VFM VP
Actual
NNM 5
NNN 7 2
NNPI 4
NVB 0
JJ 2
VFM 1
VP 1
12. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
20
12. CONCLUSION AND FUTURE WORK
In this research, our effort was mainly focused on giving a push to NLP and computational
linguistics analysis for Sinhala language by developing a tagging system (according to our
knowledge, there is no language specific tagging system available for Sinhala language). In this
paper, we have described the POS tagging approach that we have developed, which is an
implementation of stochastic model approach based on HMM. An algorithm has been produced
for the said model. The model was tested against 90551 words, 2754 sentences of Sinhala text
corpus, the tagger gave more than 90% accuracy for known words, but the system is not
performing well for the text with unknown words yet. So unknown words scenarios are still an
open area for further researches.
Though this research produced a tagger for Sinhala language, more research is required in this to
improve and optimize the algorithm. Hence, several interesting directions are suggested here for
future work.
Since new words are continuously coming into the language, handling the unknown
words (Out-Of-Vocabulary) is required.
In-addition to disambiguation, there are few other complex scenarios exist in Sinhala
language, which separate particles and post-positions, separation of compound nouns,
multiword (combination/ phrases can be function as one grammatical category) and
separation of using Nipatha (ksmd;) in different contexts, which are not handled in this
research.
Smoothing technique can be applied to get a better outcome.
ACKNOWLEDGEMENTS
I express my immense gratitude and many thanks to Mr. Harsha Kumara at University of
Kelaniya for his invaluable support in providing an initiative to NLP in Sinhala language. Many
thanks to Mrs. Kumudu Gamage at the Department of Linguistics, University of Kelaniya for her
kind support.
Glossary of Sinhala Terms
Sinhala Term English Translation
1 වෙසක් ව ෝය Vesak Poya
2 නිමිත්වෙන් due to
3 මැයි ෙැනි 2 දා on May 2nd
4 ැෙැති held
5 ශීල ෙයා ාරයට ද program of observing Sill
6 වදසීයක් මණ around two hundred
7 පිරිසක් persons
13. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
21
Sinhala Term English Translation
8 සහභාගි වූහ participated
9 bífnd tortoises, padlock
10 fokAfklA two
11 bkAkjd there are
12 od,d ;shkjd have been fixed
13 ඇරිස්ටයිඩ් name, Aristed
14 සිය පාලනයේ in his rule
15 අවසන් කාලයේ දී during the last period of
16 කියා සිටි පරිදි as told
17 ඔහුයේ his
18 ජනාධිපතිත්වයේ presidency
19 ඉතිරි කාල සීමාව තුළ during remaining time period
20 විරුද්ධවාදීන් opposition
21 සමඟ with
22 බලය power
23 යබදා ගැනීයේ distribution of
24 පදනමක් base
25 මත on
26 අර්බුදය trouble
27
යේ තරමට යමෝරා
ඒයමන්
Expanded in to this level
28 වළකා ගත හැකිව තිබිණි could be avoided
29 ලාාංකිකයින්ට for Sri Lankans
30 ආයාචනා summoned, called
31 බ්රිතාන්ය British
32 ජාතිකයන් nationalist
33 පස් යදනා 5 (5 people)
14. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
22
Sinhala Term English Translation
34 නිදහස් freedom, is released
35 සන්ධානයයන් united party
36 මහ යකොමසාරිස් high commissioner
37 උතුරට to the north
38 යයි went
39
දිේ විජයයන් ධර්ම විජය
කරන
ruled by religion
40 ඇෆ්ගන් තයේබාන්වරු Afghan Talabanish
41 isxy, jHdlrKh Sinhala grammar
42 NdIdfõ of the language
43 ksjerÈ jHdlrKh úê correct grammar in use
REFERENCES
[1] List of languages by number of native speakers, from w.w.w.wikipedia.org.
[2] Endangered Language, from www.wikipedia.org.
[3] Languages of the world, from www.bbc.co.uk/languages/guide /languages.shtml.
[4] Alecx Perera. Sinhala grammer, isxy, jHdlrKh, Wasana publication. Dankotuwa, Sri Lanka,
2004.
[5] A.A.S. Adikari. Sinhala grammer, isxy, jHdlrKh. Udaya Publications, Niwandama, Ja-ela, Sri
Lanka, 2008.
[6] A. Ratnaparkhi, “A maximum entropy Part of Speech tagger”. Proceedings of EMNLP, 1996.
[7] Daniel Jurafsky and James H. Martin, Speech and Language Processing, Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition. Person Education Inc
(Singapore) Pte. Ltd., 5th Edition, 2005.
[8] V. Dhanalakshmi, M. Anand kumar, K.P. Soman and S. Rajendran. “POS tagger and chunker for
Tamil language”. Proceedings of the 8th Tamil Internet Conference, Cologne, Germany, 2009.
[9] E. Black, “Decision tree models applied to the labeling of text with Parts of Speech”. In Darpa
workshop on speech and natural language processing, 1992.
[10] E. Brill, “Transformation based error driven learning and natural language processing”, A case
study in Part of Speech tagging in computational linguistics, 1995.
[11] K.D.C Gunasakara. Sinhala grammer, isxy, NdIdfõ ksjerÈ jHdlrKh úê. Tharanji Prints,
Navinna, Maharagama, Sri Lanka, 2008.
[12] kshar Bharathi and Prashanth R. Mannen. “Introduction to the shallow parsing contest for South
Asian languages”. Language technilogy research center, International institute of information
technology Hyderabad, India.
[13] Sandipan Dandapat, Sudeshna Sarkar, Anupam Basu, “A Hybrid model for Part of Speech tagging
and its application to Bengali”. Proceedings of international conference on computational
intelligence, 2004.
[14] Smriti Singh, Kuhoo Gupta, Manish Shrivastava and Pushpak, “Morphological richness offsets
resource demand experiences in constructing a POS tagger for Hindi”. Department of computer
science and engineering, Indian Institute of Technology, Bombay.
15. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
23
[15] T. Brants, “A statistical Part of Speech Ttagger”, Proceedings of 6th applied NLP conference,
2000.
[16] UCSC Tagset, from http: //www.ucsc.cmb.ac.lk.
[17] UCSC/LTRL Sinhala Corpus, from http://www.ucsc.cmb.ac.lk, Beta Version April 2005.
[18] Dulip Herath, Kumudu Gamage and Anuradha Malalasekara,”Research report on Sinhala
lexicon”. Langugae Technology Research Laboratory, UCSC.
Authors
A.J.P.M.P. Jayaweera is graduated from the University of Colombo, Sri Lanka and has
completed Masters in computer science from the University of Kelaniaya, Sri Lanka, and
currently working on a Master of Philosophy degree in the field of Natural Language
Processing and Computational Linguistics at the same university. Professionally he is a
professional Software engineer with 11+ years of experience in divers technologies. A
proven career records in enterprise application development which involved providing
business critical real time application for leading industries. At present, he is working as a software
Architect at Virtusa Pvt Ltd, No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka.
Dr. N. G. J. Dias is graduated from the University of Colombo, Sri Lanka, specializing
Mathematics as the main subject. He has completed Masters and Doctoral Degrees in
Computer Science from the Queen’s University of Belfast, Northern Ireland and
University of Wales, College of Cardiff, Cardiff of the United Kingdom respectively. At
present, Dr. Dias is a Professor in Computer Science attached to the Department of
Statistics & Computer Science of the University of Kelaniya, Sri Lanka. He has been
working in the field of Computer Science for the last 30 years and he is the team leader of the Natural
Language Processing and Computational Mathematics research groups of the University.