This document introduces a program for automatically analyzing Myanmar words from web documents to support the development of a Myanmar search engine. The program collects Myanmar words from over 9,000 kilobytes of web pages downloaded from various Myanmar websites. It converts the documents to a compatible Unicode encoding and extracts individual words. Frequency analysis is then performed to determine the most common words and generate a Markov chain matrix to understand word adjacency patterns in Myanmar language text.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
Nowadays web pages are implemented in various kinds of languages on Web and web crawlers are important for search engine. Language specific crawlers are crawlers that traverse and collect the relative web pages using the successive URls of web page. There is very little
research area in crawling for Myanmar Language web sites. Most of the language specific crawlers are based on n-gram character sequences which require training documents, the proposed crawler differ from those crawlers. The proposed system focused on only part of
crawler to search and retrieve Myanmar web pages for Myanmar Language search engine. The proposed crawler detects the Myanmar character and rule based syllable threshold is used to judgment the relevant of the pages. According to experimental results, the proposed crawler has better performance, achieves successful accuracy and storage space for search engines are lesser since it only crawls the relevant documents for Myanmar web sites.
Nowadays web pages are implemented in various kinds of languages on the Web and web crawlers are
important for search engine. Language specific crawlers are crawlers that traverse and collect the relative
web pages using the successive URls of web page. There are very few research areas in crawling for
Myanmar Language web sites. Most of the language specific crawlers are based on n-gram character
sequences which require training documents. The proposed crawler differs from those crawlers. The
proposed crawler searches and retrieves Myanmar web pages for Myanmar Language search engine. The
proposed crawler detects the Myanmar character and rule-based syllable threshold is used to judgment the
relevance of the pages. According to experimental results, the proposed crawler has better performance,
achieves successful accuracy and storage space for search engines are lesser since it only crawls the
relevant documents for Myanmar web sites.
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONijnlc
Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as Naïve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
This document describes the development of the first manually annotated named entity corpus for the Myanmar language. It contains approximately 170,000 named entities tagged with types like person, location, organization, race, time and number. The document also discusses experiments using various deep neural network architectures for named entity recognition on Myanmar text, without additional feature engineering. Results showed that syllable-based neural models outperformed the baseline conditional random field model. This research aims to apply neural networks to Myanmar natural language processing and promote future work on this under-resourced language.
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
Nowadays web pages are implemented in various kinds of languages on Web and web crawlers are important for search engine. Language specific crawlers are crawlers that traverse and collect the relative web pages using the successive URls of web page. There is very little
research area in crawling for Myanmar Language web sites. Most of the language specific crawlers are based on n-gram character sequences which require training documents, the proposed crawler differ from those crawlers. The proposed system focused on only part of
crawler to search and retrieve Myanmar web pages for Myanmar Language search engine. The proposed crawler detects the Myanmar character and rule based syllable threshold is used to judgment the relevant of the pages. According to experimental results, the proposed crawler has better performance, achieves successful accuracy and storage space for search engines are lesser since it only crawls the relevant documents for Myanmar web sites.
Nowadays web pages are implemented in various kinds of languages on the Web and web crawlers are
important for search engine. Language specific crawlers are crawlers that traverse and collect the relative
web pages using the successive URls of web page. There are very few research areas in crawling for
Myanmar Language web sites. Most of the language specific crawlers are based on n-gram character
sequences which require training documents. The proposed crawler differs from those crawlers. The
proposed crawler searches and retrieves Myanmar web pages for Myanmar Language search engine. The
proposed crawler detects the Myanmar character and rule-based syllable threshold is used to judgment the
relevance of the pages. According to experimental results, the proposed crawler has better performance,
achieves successful accuracy and storage space for search engines are lesser since it only crawls the
relevant documents for Myanmar web sites.
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONijnlc
Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as Naïve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
This document describes the development of the first manually annotated named entity corpus for the Myanmar language. It contains approximately 170,000 named entities tagged with types like person, location, organization, race, time and number. The document also discusses experiments using various deep neural network architectures for named entity recognition on Myanmar text, without additional feature engineering. Results showed that syllable-based neural models outperformed the baseline conditional random field model. This research aims to apply neural networks to Myanmar natural language processing and promote future work on this under-resourced language.
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
This paper presents a machine translation system that translates simple assertive English sentences to Marathi sentences. The system performs morphological analysis, part-of-speech tagging, and local word grouping to convert the meaning of the English sentence to the corresponding Marathi sentence. An English to Marathi bilingual dictionary is used for translation. The system aims to help people with primary education understand English words by providing translations to their native Marathi language.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIijaia
Maximum digital information is available to fewer people who can read or understand a particular
language. The corpus is the basis for developing speech synthesis and recognition systems. In India, almost
all speech research and development affiliations are developing their own speech corpora for Hindi
language, which is the first language for more than 200 million people. The primary goal of this paper is to
review the speech corpus created by various institutes and organizations so that the scientists and language
technologists can recognize the crucial role of corpus development in the field of building ASR and TTS
systems. This aim is to bring together all the information related to the recording, volume and quality of
speech data in speech corpus to facilitate the work of researchers in the field of speech recognition and
synthesis. This paper describes development of medium size database for Metro rail passenger information
systems using HMM based technique in our organization for above application. Phoneme is chosen as
basic speech unit of the database. The result shows that a medium size database consisting of 630
utterances with 12,614 words, 11572 tokens of phonemes covering 38 phonemes are generated in our
database and it cover maximum possible phonetic context.
This document discusses the development of Malay corpora in Malaysia. It provides background on what a corpus is and how the Malay Corpus was influenced by the Brown Corpus in the 1970s. It was developed by Dewan Bahasa dan Pustaka to create a database of 2 million Malay words from older and modern texts. The corpus is used for educational and research purposes like developing teaching materials and analyzing language use. It also discusses software for analyzing corpora and researchers involved in developing Malay corpora.
This document summarizes a research paper on developing a crawler to retrieve Myanmar language web pages. The proposed crawler uses rule-based syllable segmentation to determine if a web page is relevant for the Myanmar language. It detects Myanmar characters, normalizes fonts, segments text into syllables, and calculates a syllable threshold ratio to judge relevance. Experimental results showed the crawler achieved a precision of over 80% for pages with majority Myanmar content and discarded pages with less than 3% Myanmar syllables. This focused crawler extracts only relevant Myanmar pages to reduce storage space for search engines.
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONkevig
Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as Naïve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Arabic tweeps dialect prediction based on machine learning approach IJECEIAES
In this paper, we present our approach for profiling Arabic authors on Twitter, based on their tweets. We consider here the dialect of an Arabic author as an important trait to be predicted. For this purpose, many indicators, feature vectors and machine learning-based classifiers were implemented. The results of these classifiers were compared to find out the best dialect prediction model. The best dialect prediction model was obtained using random forest classifier with full forms and their stems as feature vector.
This document summarizes a presentation given in Malaysia on localizing Firefox to the Malay language. It discusses the localization process, using the Mercurial version control system and Narro tool to manage translations. The presentation covers translating, reviewing, and testing content, and provides 5 rules for translation weekendsprint volunteers, including using approved word references and discussion channels. Frequently asked questions are also addressed, such as Narro performance issues and how those without internet can get involved.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
Assisting Tool For Essay Grading For Turkish Language InstructorsLeslie Schulte
This document describes a tool to assist Turkish language instructors in grading student essays. The tool uses natural language processing techniques to extract features from essays written in Turkish, including morphological analysis, vocabulary used, language structures, spelling errors, and more. These features are output to an Excel file to help instructors evaluate essays on several metrics, such as keyword usage, parts of speech, verb tenses, and spelling. The tool is intended to facilitate essay grading as the number of students increases. Further development is planned to incorporate machine learning to enable more automated essay grading based on data from instructors.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal
This document describes a text-to-speech synthesis system for the Hindi language developed using the Festival framework. The system takes Hindi text as input and outputs synthesized speech. It uses a syllable-based concatenative approach where Hindi words are segmented into syllables which are then matched to recorded audio files and concatenated to generate speech. Challenges in developing text-to-speech for Hindi include accurate pronunciation rules and producing natural prosody. The system aims to improve the naturalness of synthesized Hindi speech output.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
Automatic text summarization has gained immense popularity in research. Previously, several methods have been explored for obtaining effective text summarization outcomes. However, most of the work pertains to the most popular languages spoken in the world. Through this paper, we explore the area of extractive automatic text summarization using deep learning approach and apply it to Konkani language, which is a low-resource language as there are limited resources, such as data, tools, speakers and/or experts in Konkani. In the proposed technique, Facebook’s fastText pre-trained word embeddings are used to get a vector representation for sentences. Thereafter, deep multi-layer perceptron technique is employed, as a supervised binary classification task for auto-generating summaries using the feature vectors. Using pre-trained fastText word embeddings eliminated the requirement of a large training set and reduced training time. The system generated summaries were evaluated against the ‘gold-standard’ human generated summaries with recall-oriented understudy for gisting evaluation (ROUGE) toolkit. The results thus obtained showed that performance of the proposed system matched closely to the performance of the human annotators in generating summaries.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
This paper presents a machine translation system that translates simple assertive English sentences to Marathi sentences. The system performs morphological analysis, part-of-speech tagging, and local word grouping to convert the meaning of the English sentence to the corresponding Marathi sentence. An English to Marathi bilingual dictionary is used for translation. The system aims to help people with primary education understand English words by providing translations to their native Marathi language.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
DEVELOPMENT OF PHONEME DOMINATED DATABASE FOR LIMITED DOMAIN T-T-S IN HINDIijaia
Maximum digital information is available to fewer people who can read or understand a particular
language. The corpus is the basis for developing speech synthesis and recognition systems. In India, almost
all speech research and development affiliations are developing their own speech corpora for Hindi
language, which is the first language for more than 200 million people. The primary goal of this paper is to
review the speech corpus created by various institutes and organizations so that the scientists and language
technologists can recognize the crucial role of corpus development in the field of building ASR and TTS
systems. This aim is to bring together all the information related to the recording, volume and quality of
speech data in speech corpus to facilitate the work of researchers in the field of speech recognition and
synthesis. This paper describes development of medium size database for Metro rail passenger information
systems using HMM based technique in our organization for above application. Phoneme is chosen as
basic speech unit of the database. The result shows that a medium size database consisting of 630
utterances with 12,614 words, 11572 tokens of phonemes covering 38 phonemes are generated in our
database and it cover maximum possible phonetic context.
This document discusses the development of Malay corpora in Malaysia. It provides background on what a corpus is and how the Malay Corpus was influenced by the Brown Corpus in the 1970s. It was developed by Dewan Bahasa dan Pustaka to create a database of 2 million Malay words from older and modern texts. The corpus is used for educational and research purposes like developing teaching materials and analyzing language use. It also discusses software for analyzing corpora and researchers involved in developing Malay corpora.
This document summarizes a research paper on developing a crawler to retrieve Myanmar language web pages. The proposed crawler uses rule-based syllable segmentation to determine if a web page is relevant for the Myanmar language. It detects Myanmar characters, normalizes fonts, segments text into syllables, and calculates a syllable threshold ratio to judge relevance. Experimental results showed the crawler achieved a precision of over 80% for pages with majority Myanmar content and discarded pages with less than 3% Myanmar syllables. This focused crawler extracts only relevant Myanmar pages to reduce storage space for search engines.
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONkevig
Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as Naïve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Arabic tweeps dialect prediction based on machine learning approach IJECEIAES
In this paper, we present our approach for profiling Arabic authors on Twitter, based on their tweets. We consider here the dialect of an Arabic author as an important trait to be predicted. For this purpose, many indicators, feature vectors and machine learning-based classifiers were implemented. The results of these classifiers were compared to find out the best dialect prediction model. The best dialect prediction model was obtained using random forest classifier with full forms and their stems as feature vector.
This document summarizes a presentation given in Malaysia on localizing Firefox to the Malay language. It discusses the localization process, using the Mercurial version control system and Narro tool to manage translations. The presentation covers translating, reviewing, and testing content, and provides 5 rules for translation weekendsprint volunteers, including using approved word references and discussion channels. Frequently asked questions are also addressed, such as Narro performance issues and how those without internet can get involved.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
Assisting Tool For Essay Grading For Turkish Language InstructorsLeslie Schulte
This document describes a tool to assist Turkish language instructors in grading student essays. The tool uses natural language processing techniques to extract features from essays written in Turkish, including morphological analysis, vocabulary used, language structures, spelling errors, and more. These features are output to an Excel file to help instructors evaluate essays on several metrics, such as keyword usage, parts of speech, verb tenses, and spelling. The tool is intended to facilitate essay grading as the number of students increases. Further development is planned to incorporate machine learning to enable more automated essay grading based on data from instructors.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal
This document describes a text-to-speech synthesis system for the Hindi language developed using the Festival framework. The system takes Hindi text as input and outputs synthesized speech. It uses a syllable-based concatenative approach where Hindi words are segmented into syllables which are then matched to recorded audio files and concatenated to generate speech. Challenges in developing text-to-speech for Hindi include accurate pronunciation rules and producing natural prosody. The system aims to improve the naturalness of synthesized Hindi speech output.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
Automatic text summarization has gained immense popularity in research. Previously, several methods have been explored for obtaining effective text summarization outcomes. However, most of the work pertains to the most popular languages spoken in the world. Through this paper, we explore the area of extractive automatic text summarization using deep learning approach and apply it to Konkani language, which is a low-resource language as there are limited resources, such as data, tools, speakers and/or experts in Konkani. In the proposed technique, Facebook’s fastText pre-trained word embeddings are used to get a vector representation for sentences. Thereafter, deep multi-layer perceptron technique is employed, as a supervised binary classification task for auto-generating summaries using the feature vectors. Using pre-trained fastText word embeddings eliminated the requirement of a large training set and reduced training time. The system generated summaries were evaluated against the ‘gold-standard’ human generated summaries with recall-oriented understudy for gisting evaluation (ROUGE) toolkit. The results thus obtained showed that performance of the proposed system matched closely to the performance of the human annotators in generating summaries.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese
Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that
divides text data into words and assigns information such as parts of speech. In Japanese natural
language processing systems, this technique plays an essential role in downstream applications
because the Japanese language does not have word delimiters between words. Hiragana is a type
of Japanese phonogramic characters, which is used for texts for children or people who cannot read
Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of
ordinary Japanese sentences because there is less information for dividing. For morphological
analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model
based on ordinary Japanese text and examined the influence of training data on texts of various
genres.
In this talk Chengqing presents some work on development of statistical machine translation (MT) system based on the open source toolkit Moses at CASIA. In recent years, CASIA have developed several MT systems, including Chinese-to-English and English-to-Chinese, Japanese-to-Chinese, Arabic-to-Chinese, Uigur-to-Chinese and Tibetan-to-Chinese MT systems etc. Moses is a basic translation engine in our systems. Chengqing shows audience how CASIA use and extend Moses to develop the multilingual MT systems.
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.
MosesCore is supporetd by the European Commission Grant Number 288487 under the 7th Framework Programme.
Latest news on Twitter - #MosesCore
Similar to Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+ (20)
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Presentation of the OECD Artificial Intelligence Review of Germany
Statistical Analysis Of Myanmar Words On The World Wide Web For+ Search Engine Development+
1. Statistical Analysis of Myanmar Words on the World Wide Web for
Search Engine Development
Pann Yu Mon Maung Maung Thant Ohnmar Htun Pe
s065402@ics.nagaokaut.ac.jp mmthant@gmail.com ohnmar.iuj@gmail.com
San Ko Oo Yoshiki Mikami
sankooo@gmail.com mikami@kjs.nagaokaut.ac.jp
†Management and Information Systems Engineering Department
Nagaoka University of Technology
††International University of Japan
Abstract the Indian subcontinent between 5th Century B.C
and 3rd Century AD. Myanmar language has 33
This paper introduces an automatic consonants and 12 vowels according to traditional
Myanmar word analysis program for ongoing tones on grammar.
research of Myanmar search engine development.
Since 1990 Myanmar natural language
In this research we collected Myanmar words
from documents on the World Wide Web to know processing task has been done by Myanmar
which words are frequently used. This program is Unicode & NLP Research Center. The first
designed for encodings compatible with Unicode Myanmar Unicode font for GUI environment
5.1standard. Our program can automatically (Mac) was developed in 1988 and the one for
generate Markov Chain matrix on the result Windows system was developed in 1992. In 1998,
words. The program was written by using PHP Myanmar Language processing was first
script. Myanmar head words that include in
Myanmar-English dictionary are also used as discussed at ISO/IEC JTC1 and Unicode
index words. Technical Committee and finally Myanmar
Keywords character code set was included in ISO 10646.
Until now, they keep on trying over
Myanmar, Code conversion tools, Myanmar word Myanmar language processing tasks to cope well
searching with all applications so as to complete all the tasks
to cover the whole area which requires more
1. Introduction endeavors.
In this research, the program that can
Myanmar Language, a member of the
automatically collect Myanmar words from the
Tibeto-Burman language, subfamily of the Sino-
Myanmar Web Pages is proposed. The main
Tibetan family of language, is spoken as mother
purpose of this research is to present the analysis
language by more than 37 million Burmese and as
of Myanmar words on the Myanmar Web pages to
second language by about 20 million ethnic
support Myanmar Search Engine Development.
minorities in Myanmar. It is the only official
To establish the Myanmar Search Engine, it is
language of Myanmar which is formerly known as
needed to do a lot of tasks such as indexing rule,
Burma. Myanmar language is written in a script
sorting algorithm, stemming algorithm, word
shaped in circular and semi-circular letters, which
breaking algorithm and so on.
are adopted from the Mon script. And the mon
In this study, we have collected
script is derived from Indian Brahmi flourished in
Myanmar Web pages from various Web sites
2. including Myanmar daily newspaper, community to multi-font converter to the Unicode 5.1. At last
Web sites, news Web sites total of which accounts the program run for searching the word from input
to 9,274 Kbytes. And then we extracted words text, and result words are saved in the Database.
The process will be explained step by step in the
from downloaded Myanmar Web pages. And
next section in more detail.
detail process of collecting words and analysis of
result data will be discussed in following sections. 3.1.First Step : Downloading Myanmar
Web Pages
2. Related Research
World Wide Web is the most convenient
A number of researchers not only from existing source of linguistic data providing the
local but also from word wide have collected users abundance of texts in various types in a
Myanmar words from different sources for their large number of languages. Already having in
individual purposes. electronic forms, these texts are quite suitable for
From 2007, Myanmar Unicode and NLP the corpus studies.
Research Center has started the development task In order to download Myanmar Web
pages, it needs very efficient crawler that can
of Myanmar National Corpus (MNC) [5]. MNC
collect only Myanmar Web pages selectively from
includes all texts including written text and the World Wide Web. In this research, the
spoken text from various resources. That project is Language Specific Crawler (LSC) developed by
almost finished. one of the authors [3] was used. LSC runs
Hla Hla Htay and colleagues [2] have concurrently with language identifier and collect
developed Myanmar corpora based on various Myanmar Web pages efficiently. Following table
explains the sources of the downloaded web sites.
resources such as text from official newspapers in
After downloading, downloaded pages were
Myanmar, over 300 full books and Myanmar texts passed to converter.
from various Web sites including news sites and
on-line magazines. In their research they had Table 1. Detail Information for source data
processed all their tasks based on ASCII format.
3. Methodology
3.2.Second Step : Conversion of various
encoding to Unicode 5.1 Standard
Myanmar texts on the Web are using
various encoding which are not fully compliant
with Unicode 5.1. So it is required to convert the
crawled Web Pages to Unicode encoding. If the
Web pages are encoded in Unicode then the work
Figure. 1. Step by step Procedure of Analysis becomes easier.
The step by step processes of our In order to convert various Myanmar
analysis are shown in figure 1. Firstly it needs to encodings to Unicode, an efficient converter is
collect Myanmar Web pages regardless of their needed. Currently, there are a number of
fonts and encodings. Then, we have to pass them Myanmar font conversion tools available on the
3. Web. In this research, Kanaung converter 1 and match. If no such match is found in the word lists,
Burglish converter2 were used. Although both of the character is simply segmented as a word.
them work nicely, it is still needed to edit a little
bit. For example, Kanaug converter could not 3.4. Fourth Step: Frequency Markov
covert ‘ ’ and ‘ ’ properly and correctly. In case Chain Analysis
of Burglish, it works correctly in the conversion
from “Zawgyi-One” font to “Myanmar3” font. In the program, Word-based Markov
But in the conversion from “Wininwa” font to models are also used to calculated word matrix
“Myanmar3” font, it cannot covert accurately for table to know the adjacency word in the sentences
‘ ’. And it cannot correctly work on punctuation (This mean which word most frequently appears
marks and quotation marks. Thus manual after one word.) It gives us high level background
correction is needed in those cases though they are information for word boundary detection in
somewhat perfect. parsing of the Myanmar language. Our program
firstly finds the words on the given web pages and
3.3. Third Step: Word Searching calculates the frequency of that word to know how
Algorithm many times that word appears on the Web sites.
After that, Markov chain matrix table was
Myanmar language is written in a syllabic generated automatically.
system and there are no spaces always put
between words or sentences. That is why word 4. Result
segmenting algorithm and word searching
algorithm for Myanmar Language are needed. We downloaded the various web sites
Very little research in different approach has been
including newspaper sites, blog sites,
published on segmenting sentences into words in
Myanmar language [1] [4]. entertainment sites, sport sites and collected 9,274
In our program, all of the Myanmar head Kbytes of text data. After running the program,
words that included in Myanmar–English total 766,892 words are collected and 12,211
Dictionary 3 are used as indexed file. It includes unique head words found.
28,000 Myanmar words. Those head words are
stored in the database and sorted in reverse order
of syllable length to compare with the input data.
4.1. Distribution of Words on input string
If the input word is matched with one of the head
word, the program will retrieve that word. If the It is found that mono-syllable is most
input word does not match with the head word frequently used because those words can be used
lists, the program cannot retrieve the word in several ways. For Example, mono-syllable
correctly. Thus the accuracy of this algorithm is “ ” was found more than 20,000 times.
largely depends on the head word lists. Because it can be used in different ways. For
In our algorithm the longest matching Example, in case 1: polite prefix to a young man’s
algorithm, was used to find the word on the input name (as in “ ”), in case 2: postpositional
data. It normally starts at the first character in a marker to indicate objective (as in
text using a heard word list and attempts to find “ ”), in case 3: emphatic
the longest word in the list. If such a word is particle suffixed to words (as in
found, the longest-matching algorithm marks a “ ”) and in case 4: post
boundary at the end of the longest word, and then positional marker indicating destination (as in
it repeats the same process as to start searching “ ”). And then bi-
longest match at the characters following the syllables words are second most and it is followed
by the tri-syllables and so on. The top ten words
sorted by frequency for mono-syllable, bi-
syllables, tri-syllables and tetra-syllables are
1
http://code.google.com/p/kanaung/ shown in the following tables.
2http://burglish.googlepages.com/fontconv.htm
3
Myanmar-English dictionary produced by
Department of the Myanmar Language
Commission
4. Table 2. Top ten mono-syllable words Table 3. Top ten bi-syllable words
Mono-Syllable Frequency Bi-Syllable Frequency
[ko] 20070 [Kyun 3537
Postpositional marker to (2.61%) taw] (0.46%)
indicate objective case I(male)
[ma] 18181 [Kyun ma] 3332
Partical prefixed to a verb to (2.40%) I(female) (0.43%)
the negative sence [Ka lay] 1994
[ka] 17469 Child (0.26%)
Postpositional marker to (2.30%) 1981
[A twat]
indicate nominative case (0.25%)
For
[tal] 14424
[Ae di] 1737
Colloquial form of the (1.90%)
That (0.22%)
sentence final
[par] 12774
Particle denoting inclusion (1.70%)
Table 4. Top ten tri-syllable words Table 5. Top ten tetra-syllable words
Tetra-Syllable Frequency
[sar yay sa 222
Tri-Syllable Frequency yar] (0.02%)
[Tha yot 627 Author
saung] (0.08%) [a nu pa nyar] 204
Actor Art (0.02%)
[Pa ri thet] 500 [a chay a nay] 176
Audience (0.06%) Condition (0.02%)
[Sa yar ma] 495 [a yay a tar] 157
Teacher(female) (0.06%) Writing (0.01%)
[Thu nge 404 [a mhat ta ya] 138
chin] (0.5%) Remembrance (0.01%)
Friend
[Main ka lay] 400
Girl (0.05%)
600,000 581,355
500,000
number of collected words
400,000
300,000
200,000 147,100
100,000 27,770 9,752 758 117 16 5 17 2
-
Mono- Bi- Tri- 4- 5- 6- 7- 8- 9- 10-
Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable Syllable
Number of Syllables
Figure. 2. Number of Syllables found in Test Data
5. 4.2. Word Level Frequency Matrix
Based on the input string, the program for parsing of the sentence into words. By applying
generated word level Markov table. By using this this algorithm in character level we can also generate
matrix we can know adjacency word pairs. It a character level Markov table. It can be used in
gives us the high level background information Myanmar character input method to Mobile phone.
Table 6 .Word-Level Matrix
Sum of Second
Frequency Word
Grand
First Word Total
1144 1144
722 1273 1217 4893
1564 2343
1339 1511 2850
934 934
1205 1717 2922
809 1754
Grand
Total 1205 722 2651 1339 1273 2373 1511 1144 1217 16840
4.3. Distribution of characters on Input
String It is found that the words begins with the “ ” is the
over 90,000 and it is first ranking character. And it is
followed by the “ ” and so on. No words are found
We analyzed character level frequency of the that starting with the characters “ ”. We could not
input data. The result is shown in Figure 3. find that words even in the Myanmar – English
dictionary.
100000
90000
80000
70000
number of collected words
60000
50000
40000
30000
20000
10000
0
List of Characters
Figure. 3. Total Frequency of Myanmar Characters found in Test Data
6. 5. Error Analysis expect this ongoing research will yield benefits
for our Myanmar search engine development task.
In our test data of 9,274 Kbytes, we
found 2,935,233 characters which excluding Acknowledgements
punctuation marks, numerals and English words.
In terms of words, we identified total 766,892 We acknowledge and highly appreciate
Myanmar words (12,211 unique headwords). But the kind assistance and help given by Myanmar
5,861 words (0.76%) were not identified. The Unicode & NLP Research Center. We would like
errors result from the incorrect spelling in the to express our thanks to Dr. Daw Myint Myint
original text, undefined headwords (proper nouns Than and U Ngwe Tun as they kindly provided us
which are not defined in the dictionary) and the data we are in need of.
incorrect description of syllable length in the
database. Moreover, some error results from the References
words ending with some characters such as “ ”
(Myanmar Sign Dot Below) and ambiguity in
word segmentation. Some examples of errors are [1] Hla Hla Htay and et al., “Myanmar Word
listed in Table 7. Segmentation using Syllable level Longest
Matching”, Proceedings of the 6th Workshop on
Asian Language Resources (ALR6), Hyderabad,
Table 7. Some Examples of errors
India, January 2008.
[2] Hla Hla Htay, G. Bharadwaja Kumar and
Kavi N. Murthy, “Constructing English-Myanmar
Parallel Corpora”. The Fourth International
Conference on Computer Application 2006.
[3] Pann Yu Mon, Chew Yew Choong, Yoshiki
Mikami, “Language Specific Crawler for
Myanmar Pages”, Proceedings of the 11th
International Conference on Humans and
Computers (HC 2008), Nagaoka, Japan,
November 2008.
[4] Tun Thura Thet and et al., “Word
Segmentaion of the Myanmar Language”, Journal
of Information Science, Vol. 34, No.5, pp 688-
704. 2008
[5] Wunna Ko Ko and Thin Zar Phyo, “Selection
of XML tag set for Myanmar National Corpus”,
6. Conclusion Proceedings of the 6th Workshop on Asian
Language Resources (ALR6), Hyderabad, India,
In this paper, we presented word January 2008.
segmentation program for Myanmar text based on
longest string matching algorithm and dictionary.
Also we presented both word level and character
level frequency distributions and word level
Markov table generated by this program. The
program performed segmentation work well and
proved itself to be used as a practical word
segmentation engine for various NLP applications,
including Myanmar search engine (in particular
word stemming engine). Statistical data generated
by this program is useful as background
information for designing various Myanmar NLP
applications including input system etc. For future
task, we plan to extend our program by collecting
all possible Myanmar words including not only
conversational words but also proper nouns. We