Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
In this paper, we propose a compression based multi-document summarization technique by incorporating
word bigram probability and word co-occurrence measure. First we implemented a graph based technique
to achieve sentence compression and information fusion. In the second step, we use hand-crafted rule
based syntactic constraint to prune our compressed sentences. Finally we use probabilistic measure while
exploiting word co-occurrence within a sentence to obtain our summaries. The system can generate summaries for any user-defined compression rate.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
An exploratory research on grammar checking of Bangla sentences using statist...IJECEIAES
N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
The project is developed as a part of IRE course work @IIIT-Hyderabad.
Team members:
Aishwary Gupta (201302216)
B Prabhakar (201505618)
Sahil Swami (201302071)
Links:
https://github.com/prabhakar9885/Text-Summarization
http://prabhakar9885.github.io/Text-Summarization/
https://www.youtube.com/playlist?list=PLtBx4kn8YjxJUGsszlev52fC1Jn07HkUw
http://www.slideshare.net/prabhakar9885/text-summarization-60954970
https://www.dropbox.com/sh/uaxc2cpyy3pi97z/AADkuZ_24OHVi3PJmEAziLxha?dl=0
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
In this paper, we propose a compression based multi-document summarization technique by incorporating
word bigram probability and word co-occurrence measure. First we implemented a graph based technique
to achieve sentence compression and information fusion. In the second step, we use hand-crafted rule
based syntactic constraint to prune our compressed sentences. Finally we use probabilistic measure while
exploiting word co-occurrence within a sentence to obtain our summaries. The system can generate summaries for any user-defined compression rate.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Extracts objects from the entire collection, without modifying the objects themselves. where the goal is to select whole sentences (without modifying them)
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
An exploratory research on grammar checking of Bangla sentences using statist...IJECEIAES
N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESmlaij
The proposed methodology presented in the paper deals with solving the problem of multilingual speech
recognition. Current text and speech recognition and translation methods have a very low accuracy in
translating sentences which contain a mixture of two or more different languages. The paper proposes a
novel approach to tackling this problem and highlights some of the drawbacks of current recognition and
translation methods.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Context-based Numeral Reading Technique for Text to Speech Systems IJECEIAES
This paper presents a novel technique for context based numeral reading in Indian language text to speech systems. The model uses a set of rules to determine the context of the numeral pronunciation and is being integrated with the waveform concatenation technique to produce speech out of the input text in Indian languages. For this purpose, the three Indian languages Odia, Hindi and Bengali are considered. To analyze the performance of the proposed technique, a set of experiments are performed considering different context of numeral pronunciations and the results are compared with existing syllable-based technique. The results obtained from different experiments shows the effectiveness of the proposed technique in producing intelligible speech out of the entered text utterances compared to the existing technique even with very less storage and execution time.
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
J. Anurag, P. Nupur and Agrawal, S.S.
School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India
Centre for Development of Advanced Computing, Noida, India
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Source side pre-ordering using recurrent neural networks for English-Myanmar ...IJECEIAES
Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
Similar to ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION FROM ENGLISH TO AN INDIC LANGUAGE (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
2. 422 Computer Science & Information Technology (CS & IT)
languages. This expert defined segmentation can be based on simple characters in the two
alphabets or natural phonetic segments (based on CV, CVC model) - the mapping rules however
have to take into account the context (the prior few and the next few segments) of a source word
segment (due to pronunciation context, etc.), making the task of defining the rules quite difficult
(in fact, this rule-based approach is followed in many English-to-Indic Transliterators available
freely today); The rule-set should be such that every possibility can be properly captured in the
rules, making the process cumbersome, and language and expert dependent. The other option is to
learn the rules from carefully constructed training sets (which is perhaps an easier task) using
machine learning approaches- these approaches aim to automate the process of discovering the
rule-set from the training data and if the training set has sufficient generalization and if the
approach is robust, then this approach can succeed and hope to even better an expert created set
of rules: clearly the method must be good at discovering and weighting the multiple and perhaps
context-sensitive mappings between source and target segment sets. Where experts at rule-set
creation are lacking, the statistical approach may be the only recourse to coming up with a good
transliteration, making this language agnostic approach easy to use by non-experts. Especially in
a domain like web search where the search terms are used by even native speakers with multiple
written variations for the same phonetic intent, it goes without saying that all relevant variations
of the source word must be discovered, matched, and ranked based on various criteria including
contextual and statistical ones, making the statistical discovery approach the more viable one. In
this paper, we take this machine learnt approach (using the maximum entropy method), with the
segmentation and alignment during training based on a natural CV, CVC model, and we use the
same segmentation during testing.
In contrast to the syllabic method that we follow here, learning the alignment at the individual
character or grapheme level purely from data has been studied before [9], and while it does not
have the highest accuracy, it is robust, language agnostic, and does not require rule experts. On
the other hand, learning the alignment at our phonetic segment level has also been studied for
English-Chinese [2]. Our basic question about this syllabic method was "does it help Indic
languages?" This question becomes more interesting because Indic languages have a "alpha-
syllabary", where vowel inflections of a basic consonant sound in the source word (e.g. the
English segments 'ka' and 'ki') are represented as vowel diacritic symbol next to the basic
consonant symbol. This is easily mapped into a target Unicode sequence which has the Unicode
element for the 'k' sound preceded or succeeded by the Unicode element for the appropriate vowel
diacritic (it helps to think of vowels in Latin and generally western scripts as performing the same
diacritic symbol role, but in a split way and reconstructed the same way cognitively - that is, the
'ka' would be rendered as two graphemes in English but as a single grapheme in the target Indian
language). The question for an Indic language then becomes one of whether learning this
mapping at our larger syllable-like segment level would be able to beat the more atomic
grapheme level mapping (note that context - that is, previous i and next j segment windows -
plays a role in learning the mapping both in [9] and ours, regardless of the difference in training
pairs). Where the languages are not as simple and do not have alpha-syllabaries (e.g. the dialect
scenario encountered in the prior English-Chinese work [2]), the authors have to use more
involved phonetic segmentation to agglomerate the source symbols. This summarizes what is
easy about statistically learning a English->Indic mapping. The harder perhaps counter-intuitive
part is our main result of this paper: shifting up the learning from grapheme level to a subsyllable-
like segment level does not appear to provide significant additional benefit for Indian languages,
at least for similar training set sizes. Note that our approach is not based on phoneme or
pronunciation models, but capture the net effect of these in a rank ordered text-to-text mapping
based on learning from an appropriately segmented training set.
3. Computer Science & Information Technology (CS & IT) 423
By way of published history of the methods investigated in this area, we note the following. [8]
experimented with a Finite State Transducer based model followed by the EM algorithm. [1], [3]
used the alignment model used in statistical machine translation systems. [17] proposed a
transliteration method without any parallel resources. Other approaches to transliteration include-
using bilingual corpora, web-mining and comparing terms of different languages in a phonetic
space. [21] compared and evaluated transliteration alignment and its impact on transliteration
efficiency. To see in context, our work considers a source-target language alignment for an Indian
language by considering the syllable-like segments instead of individual characters. Segmenting
word pairs as 'syllabic sequences' instead of characters was mentioned in [10] and [22] using rule
based approaches. [18], [16] used monotone search based models and bi-directional character
alignment respectively for a substring based transliteration system, which is a form of
segmentation. [7] used dynamic programming to learn the segmentation process. There is also
another route followed in [24] which solves the easier problem of transliteration as a special case
of machine translation by using phrase-based statistical machine translation techniques. This is in
effect the closest to our approach, but our easier method is specifically tailored for Indic
languages.
Due to lack of standardized data sets for Transliteration into Indic languages, our experiments
were done on an artificially created and partially validated dataset from a monolingual word-list,
by generating multiple possible Romanized spellings for every given Telugu word in a word-list
(this portion of the paper can be thought of as engineering to create a dataset, and not in the main
flow). We believe that the results would be the same if we had standardized data sets. Also we
note that we did not use any post-processing based on a speller for better ranking but used
bigram-trigram weights to rank order, and we believe that this did not have significant effect on
our results. Also since our method is extensible to most Indic and also other language groups, we
believe similar results would apply there.
2. LEARNING A TRANSLITERATION FROM SYLLABLE-LIKE
ALIGNMENT
Say we are given a sufficiently large set of training pairs (s,t) where s and t are from the source
language S and target language T. The problem is to learn a general transliteration mapping from
the S to T such that given an unseen word s from S, we can construct a reasonable transliteration t
from T. The basic idea is to learn this by enumerating and estimating the mapping by segmenting
each two-word pair <s,t> in the training set into disjoint concatenated segments < s=(s1 ,s2,…, sk),
t=(t1,t2,…, tl)>, such that k==l, and each si aligning with ti for all1 <= i < k. Instead of a brute
force segmentation which is intractable, we can use a character level segmentation [9] or choose a
segmentation of the source and target words in some consistent and natural way – one clear
choice is a fixed syllable and sub-syllable based segmentation grounded in phonology and
phonotactics. This is the method we adopt. The question now is whether this approach can learn
the Transliteration more accurately than the character level approach, with the same or similar
training set. Intuitively the answer seems to be yes, the reason why we started this work, but
however, our results seem to indicate otherwise.
Here, we learn the mapping with the maximum entropy method used in [9], which is this: given a
sequence pair < s=(s1 ,s2,…, sp), t=(t1,t2,…, tq)>, and the next segments < sp+1 , tq+1 >, the joint
probability of this occurring is determined by how many times this event occurred in the training
set (subject to window length constraints): if it did not occur at all, then we assume that all such
4. 424 Computer Science & Information Technology (CS & IT)
events have equal likelihood, which is the assumption that maximizes the entropy of the state
(hence the name). So the algorithm that does this will simply keep track of the probabilities in a
table as training proceeds: during test time, the input sequence is parsed from left to right (subject
to window constraints) and the probability sequence unwound from the table - when we reach the
right end, we would have computed a probability of every target sequence thus encountered- by
keeping only the larger probability target sequences (a process we term as beam forming where
the beam represents the higher probability expansions from left to right), we can keep the
complexity under check while not losing the top candidate target sequences. By looking at
transliteration units (TU) that are k windows to the left and right of the current TU, when looking
for a proper alignment, we maintain a context (note that the context is learnt not enforced) and
also build the probability beam.
We borrow from the Speech domain the syllable or sub-syllable as the unit of segmentation: the
simple (V, VC, CV, CVC) model of syllable-formation comes to mind, but defining the syllable
boundary consistently for a single language as well as from source to target is needed to make
training in this context possible. We adopt a simple model (explained in Section 3) to segment the
words of both languages during Training. This model is designed to enforce phonetic consistency
(of the source and the target segmentations) by appealing to vowel sequences as markers (we
have also tried a consistent syllable/sub-syllable segmentation but the results were not
significantly different and training sets need to be larger in this case to learn the larger and more
varied segments in addition to larger training times). The Indic language of our choice during
training was Telugu - however, the methods would work identically for other Indic languages as
well, and for many other languages as well.
3. OUR APPROACH TO WORD SEGMENTATION AND ALIGNMENT
For details on syllabic and phonetic structure of Indic languages and scripts, we refer the reader to
[12]. We explored a couple of approaches to word segmentation during both Training and
Testing, but here we focus on the details of only one. One can construct other methods based on
the (V, VC, CV, CVC) model also (we in fact did this), but we believe that the specific method
used here does not have a significant bearing on the conclusions we reach.
3.1 Word Segmentation during Training Phase
The English word is first broken in to segments in a multi stage process:
1. Aligning the English-Indic pair character by character
2. Segmentation of the Indic word
3. Using the segmented Indic word and the character alignment to generate Indic script
based segmentation of the English word.
Step 1: Character by Character Alignment of the English-Indic pair:
The English characters are aligned to Indic character sequences, by making use of a mapping
between English and Indian alphabets to decide on the alignment. The English character
alignment array is of the length equal to the English string and its elements can take values of
(0,1,2) depending on how many Indic characters does the English character map to. The process
of character alignment proceeds as follows:
5. Computer Science & Information Technology (CS & IT) 425
1. Iterate through the English word character by character
2. At each character, check the respective Indic character. If the English character can map
to the current and next Indic characters (which can be understood from the mapping
table) assign 2 to the corresponding alignment array element and increment the Indic
character index. Else, If the English character can map to the current Indic character,
assign 1 to the corresponding alignment array element. Else, if the English character does
not map to the current Indic character at all, assign 0 to the corresponding alignment
array element
This will give our English character alignment array by the end of iterations through the English
string. A sample mapping for English-Telugu alignment (“#” indicates null mapping) is in the
Figure below:
Step 2: Generating Indic Segment Array:
Here, each half vowel that combines with a consonant (preceding it in the character sequence) is
treated as a separate unit. Therefore, the word Australia splits into (in
Telugu Script). While, splitting by character, the word has 11 characters-
.
The Indic segmentation array is an array of length equal to the number of segments of the Indic
string (6 in this case). Each element of this array indicates the number of characters in that
segment. Hence, the Indic segment array here is (1 2 2 2 2 2). Just like diphthongs in English,
certain Indic languages (like Hindi) allow juxtaposition of two full vowels, and our method can
be suitably modified to take care of this.
Step 3: Generating English Segment Array:
The English segment array is generated by combining the English character alignment array
generated from Section ref{step1} and Indic segmentation array generated previously as follows:
For the example considered, English character Array: (1 0 2 2 1 1 1 1 2) and the Indic
segmentation Array (1 2 2 2 2 2).
The English segments Array is equal in length to the Indic segments Array and each element in it
will be equal to the number of English character array elements that should be combined to form
an Indic segmentation array element. For example in the word-pair, (Australia, ), the
English segmentation becomes (au-s-t-ra-li-a). Note that this is based on simple sequential
addition of the English character array to match the number in the Indic segment array, and
iterating - the number of iterations determines the number of English segments generated. For
example, the English segment array here is (2 1 1 2 2 1), where the first 2 represents the number
6. 426 Computer Science & Information Technology (CS & IT)
of elements from the English character array to sum up to the first number (that is, 1) in the Indic
segment array. We have verified the phonetic consistency of this scheme arising from positioning
on the vowel markers. Some sample alignments are shown here, including non-Indic origin words
and source words with silent characters
.
3.2 Word Segmentation during Testing Phase
During testing phase, the absence of a word pair makes the segmentation process slightly tricky,
since the Segmentation of English word is based on that of a putative Indic word. This is done in
two steps:
1. An approximate Unicode (Indic word) representation of the Roman word is obtained
2. Word segmentation algorithm explained in previous section is applied on this word pair
English word-Its approximate Unicode representation.
Obtaining an approximate Unicode representation of the Roman word is done by maintaining an
approximate one-to-one mapping between the English alphabet and Indic alphabet. This need not
cover all the Indian alphabet, since it is only an intermediate stage. For example, a sample
mapping will be similar to that of the alignment in the figure in Step 1 of 3.1: a one-one mapping
between consonants and one-to-two mapping for vowels to enable choosing one of them
depending on their presence after consonants or independent existence. Thus, for (Australia), the
Telugu equivalent in this mapping will be . Now, from the previous section: English
alignment Array: (1 0 2 2 1 1 1 1 1) (Refer: Step 1 of previous section), and the Indic segment
Array: (1 2 2 2 2 1) (Refer: Step 2 of previous section). Hence, the English segments array: (2 1 1
2 2 1) is (au-s-t-ra-li-a). Thus the segmentation is performed on English words during the Testing
phase, in the absence of parallel word pairs. Note the slight difference in the array elements
between training and testing phase, which does not lead to any loss in accuracy. Again, we have
manually verified the segmentation of innumerable test words to ensure the results of this paper
are not affected by errors in this section.
3.3 Learning a Good Segment Mapping
We trained a segmentation based transliterator using a language-agnostic learner which works as
follows:
1. Take the training pair and segment and align as mentioned in the previous section. (The
datasets are explained in Section 4).
2. Run the Maximum Entropy method of [9], suitably modified to learn on segments instead
of characters. This gives us the mapping probabilities of (English -> Indic) segments
learned from the context of the Training set
3. On a test input, we produce a beam as in [9], with the additional step of further re-ranking
using language bigrams and trigrams. The weights assigned to the bigrams and trigrams are
chosen on a trial and error basis. A weighted sum of (3*bigram + 2*trigram) was found to
give optimal performance.
7. Computer Science & Information Technology (CS & IT)
4. DATA SETS : TRAINING
As explained before, due to lack of standardized training sets for Indic languages, the dataset for
training was generated automatically while the test set was obtained from a
experiments
Training Set Collection: The parallel data has been automatically generated applying a set of
transliteration rules (as explained in Section 3.1), created heuristically, on the monolingual word
list, obtained from a search engi
language word to possible Roman equivalents. We have obtained 200K word pairs through this
process. However, a subset of this collection was chosen for training purpose.
Optimal Training Set Collection:
speech corpus creation ([20] and [11]), to select optimal training data. Thus, a dataset of around
80K words were chosen to train the transliterator.
Test Data: The test transliteration data
experiments with live users and it consisted of a general conversational text. A data of 20K word
pairs was collected through these experiments, for the language pair
5. EXPERIMENTS AND R
We evaluated the segmentation as well as character alignment based transliteration, using the
metrics defined by [6]. The table figure here shows the results on the optimal training set, of both
the character alignment and segmentation approaches
stands for accuracy at position i or less, and MRR stands for mean reciprocal rank. We use a three
dimensional feature vector (a,b,c), where {a} and {b} refers to window of source units before and
after the current transliteration unit, {c} refers to the window of the target units before the current
transliteration unit. This models the local contextual component of the learner, and long range
contextual dependence was not modeled. Perhaps as a result of this, the le
rapidly indicating that the local context was being identified quickly in the learning process.
It can be observed that our approach improved the Top
can also see that the original character
when the Top-10 results of the system are re
Computer Science & Information Technology (CS & IT)
RAINING AND TESTING
As explained before, due to lack of standardized training sets for Indic languages, the dataset for
training was generated automatically while the test set was obtained from a series of user
The parallel data has been automatically generated applying a set of
transliteration rules (as explained in Section 3.1), created heuristically, on the monolingual word
list, obtained from a search engine index. These rules mapped each alphabet from the Indian
language word to possible Roman equivalents. We have obtained 200K word pairs through this
process. However, a subset of this collection was chosen for training purpose.
ection: We applied a greedy algorithm based approach followed in
speech corpus creation ([20] and [11]), to select optimal training data. Thus, a dataset of around
80K words were chosen to train the transliterator.
The test transliteration data was obtained through a series of data collection
experiments with live users and it consisted of a general conversational text. A data of 20K word
pairs was collected through these experiments, for the language pair - English-Telugu.
RESULTS
We evaluated the segmentation as well as character alignment based transliteration, using the
metrics defined by [6]. The table figure here shows the results on the optimal training set, of both
the character alignment and segmentation approaches with different feature sets. Here, ACC
stands for accuracy at position i or less, and MRR stands for mean reciprocal rank. We use a three
dimensional feature vector (a,b,c), where {a} and {b} refers to window of source units before and
transliteration unit, {c} refers to the window of the target units before the current
transliteration unit. This models the local contextual component of the learner, and long range
contextual dependence was not modeled. Perhaps as a result of this, the learner converged quite
rapidly indicating that the local context was being identified quickly in the learning process.
It can be observed that our approach improved the Top-1 transliteration accuracy. However we
can also see that the original character alignment model outperformed our subsyllable
10 results of the system are re-ranked using simple and common techniques,
427
As explained before, due to lack of standardized training sets for Indic languages, the dataset for
series of user
The parallel data has been automatically generated applying a set of
transliteration rules (as explained in Section 3.1), created heuristically, on the monolingual word
ne index. These rules mapped each alphabet from the Indian
language word to possible Roman equivalents. We have obtained 200K word pairs through this
We applied a greedy algorithm based approach followed in
speech corpus creation ([20] and [11]), to select optimal training data. Thus, a dataset of around
was obtained through a series of data collection
experiments with live users and it consisted of a general conversational text. A data of 20K word
Telugu.
We evaluated the segmentation as well as character alignment based transliteration, using the
metrics defined by [6]. The table figure here shows the results on the optimal training set, of both
with different feature sets. Here, ACC-i
stands for accuracy at position i or less, and MRR stands for mean reciprocal rank. We use a three
dimensional feature vector (a,b,c), where {a} and {b} refers to window of source units before and
transliteration unit, {c} refers to the window of the target units before the current
transliteration unit. This models the local contextual component of the learner, and long range
arner converged quite
rapidly indicating that the local context was being identified quickly in the learning process.
1 transliteration accuracy. However we
alignment model outperformed our subsyllable-like model
ranked using simple and common techniques,
8. 428 Computer Science & Information Technology (CS & IT)
including bigram frequencies (these experiments were repeated on the full training dataset of
200K, with similar numbers, which proved our intuition of selecting the optimum training set to
be correct). A possible explanation of the above results: the Indic phonetic/grapheme/unicode
matching and unicode concatenation form a natural training "window" making the previous
simple unicode based character level alignment method at least as powerful compared to our
syllable based method.
6. CONCLUSIONS AND FUTURE DIRECTIONS
Based on the methods here and the results from our experimentation, we did not see any gains
from the syllable-like approach of the transliteration. We think that the conclusions would be the
same regardless of the learning method used. We think that if we repeated our experiments on
other Indic languages, the findings would not change, and would still favor the atomic letter level
alignment, in spite of the seeming attractiveness of syllable-like segmentation and alignment. We
suspect that the syllable-like method is not learning any increased contextual discrimination
between a compound symbol (e.g. "ta" ) as opposed to their broken-down components (e.g. "t" +
"a"), either because that increased contextual discrimination does not exist in the training set (or
in the language meaning that the alternate phonetic contextual sub-sequences are equally likely,
yielding maximum entropy anyway) or the method is being overwhelmed by the much larger set
of compound symbols. While this is not the last word on the utility of our syllable-like
transliteration method, it looks like it is hard to beat the learning of a traditional letter-level or
atomic-level mapping (by the way, this method can also learn compound symbol combinations
when needed by varying window sizes) and its most significant advantage is that it is mostly
language agnostic.
ACKNOWLEDGMENTS
The authors would like to thank Dr. Kumaran for extensive help and suggestions and Rahul
Maddimsetty for collaborating on rule-based approaches
REFERENCES
[1] Abdul Jaleel and Larkey: Statistical Transliteration for English Arabic Cross-Language Information
Retrieval, Proceedings of CIKM, 2003, p. 139-146
[2] W. Gao, K Wong, and W. Lam: Phoneme-based Transliteration of Foreign Names for OOV Problem,
Proceedings of the IJCNLP, 2004
[3] P. Virga, and S. Khudanpur: Transliteration of Proper Names in Cross Lingual Information Retrieval,
Proceedings of the ACL Workshop on Multi-lingual Named Entity Recognition, 2003
[4] Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari Visala, and Kalervo Järvelin: Fuzzy
Translation of Cross-Lingual Spelling Variants, Proceedings of SIGIR, 2003
[5] D.Goldwasser and D.Roth: Active sample selection for named entity transliteration, Proc. of the
Annual Meeting of the Association of Computational Linguistics (ACL), 2008
[6] Haizhou Li, A Kumaran, Min Zhang and Vladimir Pervouchine: Whitepaper of NEWS 2009 Machine
Transliteration Shared Task, Proceedings of Named Entities Workshop, ACL 2009, p. 19-26
[7] Jeff Pasternack and Dan Roth, Learning Better Transliterations}: Proceedings of CIKM, 2009
[8] Kevin Knight and Jonathan Graehl: Machine Transliteration, Computational Linguistics, 24(4)
[9] Kumaran A. and Kellner T: A generic framework for machine transliteration, Proceedings of SIGIR,
p. 721-722
9. Computer Science & Information Technology (CS & IT) 429
[10] Long Jiang, Ming Zhou, Lee-Feng Chien and Cheng Niu: Named Entity Translation with Wed
Mining and Transliteration, Proceedings of IJCAI 2007
[11] Rahul Chitturi, Sebsibe H Mariam and Rohit Kumar: Rapid Methods for Optimal Text Selection,
Proceedings of RANLP
[12] Richard Sproat: A formal computational Analysis of Indic Scripts, The International Symposium on
Indic Scripts: Past and Future
[13] Richard Sproat, Tao Tao and ChengXiang Zhai: Named entity transliteration with comparable
corpora, Proceedings of ACL 2006
[14] Sandeva G, Hayashi Y, Itoh Y and Kishino F., Srishell Primo: A Predictive Sinhala Text Input
System, Proceedings of IJCNLP workshop on NLP for Less Privileged Languages, 2008
[15] Scott McCarley J., Cross: Language Name Matching, Proceedings of ACM-SIGIR, 2009
[16] Sravana Reddy and Sonji Waxmonsky: Substring-based Transliteration with Conditional Random
Fields, Proceedings of Named Entities Workshop, ACL 2009, p. 96-99
[17] Sujith Ravi and Kevin Knight: Learning Phoneme Mappings for Transliteration without Parallel Data,
Proceedings of the 47th Annual Meeting of the ACL and the 5th IJCNLP
[18] Tarek Sherif and Grzegorz Kondrak, Substring-Based Transliteration: Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics, 2007}. p. 944-951
[19] U. Hermjakob, K. Knight, and H. Daume III: Name translation in statistical machine translation -
learning when to transliterate, Proc. of the Annual Meeting of ACL, 2008[
[20] Van Santen J.P.H. and Bushsbaum A.L.: Methods for optimal text selection, Proceedings of
Eurospeech, p. 553-556
[21] Vladimir Pervouchine, Haizhou Li and Bo Lin, Transliteration Alignment, Proceedings of the 47th
Annual Meeting of the ACL and the 5th IJCNLP, p. 136-144.
[22] Xue Jiang, Le Sun and Dakun Zhang: Syllable-based Name Transliteration System, Proceedings of
Named Entities Workshop, ACL 2009, p. 96-99
[23] Yo Ehara and Kumiko Tanaka-Ishii: Multilingual Text Entry using Automatic Language Detection,
Proceedings of IJCNLP, 2008
[24] Li Haizhou, Zhang Min, and Su Jian: A Joint Source-Channel Model for Machine Transliteration.
Proceedings of ACL, 2004