The document describes a prototype system that provides information about common abbreviations to users through text messaging on low-cost mobile phones without internet access. The system receives an acronym query in Roman script from the user, scrapes a brief definition from the English Wikipedia page, translates it to the user's native language, and sends the response via text message transliterated in Roman script. It aims to help semi-literate users who may lack English knowledge or technological skills access information about abbreviations they encounter.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
This document presents a technique for identifying scripts (Tamil, English, Hindi) and numerals from multilingual document images using a rule-based classifier. Words are segmented and the first character of each word is represented as a 9-bit vector based on features like density, shape, and transitions. A rule-based classifier containing rules derived from training data is used to classify the script of each character. The technique aims to automatically categorize multilingual documents before applying optical character recognition and requires minimal preprocessing with high accuracy.
The document discusses techniques for classifying and filtering email messages. It describes using a Naive Bayes classifier with word frequency statistics to efficiently classify messages into folders. The key is leveraging each user's personal filtering preferences through supervised training on their labeled messages. The system was tested on over 7,000 messages across four users, achieving 89% average accuracy at filtering speed of 259 messages per second.
This document describes research on developing a Bangla programming language and compiler. The researchers created a Bangla programming language with syntax similar to BASIC to make it easier for beginners. They developed a compiler that compiles the Bangla source code into intermediate code. The researchers implemented the system in Java and tested it on sample code files, achieving accuracy rates from 33-100% on translating keywords and data types. The goal was to introduce programming to students in their native Bangla language.
Development and testing of an FPT.AI-based voicebotjournalBEEI
In recent years, voicebot has become a popular communication tool between humans and machines. In this paper, we will introduce our voicebot integrating text-to-speech (TTS) and speech-to-text (STT) modules provided by FPT.AI. This voicebot can be considered as a critical improvement of a typical chatbot because it can respond to human’s queries by both text and speech. FPT Open Speech, LibriSpeech datasets, and music files were used to test the accuracy and performance of the STT module. For the TTS module, it was tested by using text on news pages in both Vietnamese and English. To test the voicebot, Homestay Service topic questions and off-topic messages were input to the system. The TTS module achieved 100% accuracy in the Vietnamese text test and 72.66% accuracy in the English text test. In the STT module test, the accuracy for FPT open speech dataset (Vietnamese) is 90.51% and for LibriSpeech Dataset (English) is 0% while the accuracy in music files test is 0% for both. The voicebot achieved 100% accuracy in its test. Since the FPT.AI STT and TTS modules were developed to support only Vietnamese for dominating the Vietnam market, it is reasonable that the test with LibriSpeech Dataset resulted in 0% accuracy.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
This document presents a technique for identifying scripts (Tamil, English, Hindi) and numerals from multilingual document images using a rule-based classifier. Words are segmented and the first character of each word is represented as a 9-bit vector based on features like density, shape, and transitions. A rule-based classifier containing rules derived from training data is used to classify the script of each character. The technique aims to automatically categorize multilingual documents before applying optical character recognition and requires minimal preprocessing with high accuracy.
The document discusses techniques for classifying and filtering email messages. It describes using a Naive Bayes classifier with word frequency statistics to efficiently classify messages into folders. The key is leveraging each user's personal filtering preferences through supervised training on their labeled messages. The system was tested on over 7,000 messages across four users, achieving 89% average accuracy at filtering speed of 259 messages per second.
This document describes research on developing a Bangla programming language and compiler. The researchers created a Bangla programming language with syntax similar to BASIC to make it easier for beginners. They developed a compiler that compiles the Bangla source code into intermediate code. The researchers implemented the system in Java and tested it on sample code files, achieving accuracy rates from 33-100% on translating keywords and data types. The goal was to introduce programming to students in their native Bangla language.
Development and testing of an FPT.AI-based voicebotjournalBEEI
In recent years, voicebot has become a popular communication tool between humans and machines. In this paper, we will introduce our voicebot integrating text-to-speech (TTS) and speech-to-text (STT) modules provided by FPT.AI. This voicebot can be considered as a critical improvement of a typical chatbot because it can respond to human’s queries by both text and speech. FPT Open Speech, LibriSpeech datasets, and music files were used to test the accuracy and performance of the STT module. For the TTS module, it was tested by using text on news pages in both Vietnamese and English. To test the voicebot, Homestay Service topic questions and off-topic messages were input to the system. The TTS module achieved 100% accuracy in the Vietnamese text test and 72.66% accuracy in the English text test. In the STT module test, the accuracy for FPT open speech dataset (Vietnamese) is 90.51% and for LibriSpeech Dataset (English) is 0% while the accuracy in music files test is 0% for both. The voicebot achieved 100% accuracy in its test. Since the FPT.AI STT and TTS modules were developed to support only Vietnamese for dominating the Vietnam market, it is reasonable that the test with LibriSpeech Dataset resulted in 0% accuracy.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document discusses font and size identification in Telugu printed documents. It provides background on the Telugu script, which contains a large number of compound characters formed from combinations of vowels and consonants. The document then discusses the need for font and size identification as a preprocessing step for optical character recognition (OCR) systems to improve accuracy. It presents an approach using zonal analysis and connected component analysis to extract features from text images like aspect ratio and pixel ratio to identify the font and size by comparing to a database. Results showed this approach could accurately identify different fonts and sizes in Telugu text images.
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone. The newmethod proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
This document describes a study on improving Tamil-English cross-language information retrieval through transliteration generation and mining techniques. The study achieved a peak Mean Average Precision of 0.5133 for monolingual English retrieval and 0.4145 for Tamil-English cross-language retrieval, representing an improvement over baselines without handling out-of-vocabulary terms. Transliteration mining performed better than generation at resolving out-of-vocabulary terms and boosting retrieval performance.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Through-Mail Feature: An Enhancement to Contemporary Email ServicesIRJESJOURNAL
ABSTRACT: - in many organisations where there exists several levels of hierarchy, there is often the need to route a request from one level to the next until it reaches the last level where a decision is ultimately taken. In contemporary email services, this will be usually achieved by composing a mail and either forwarding it from one intermediary to another, or carbon-copying to all intermediaries. Unfortunately, these options have several drawbacks one of which is that the content of the original request can be modified by any member in the route. In this paper, we add a through-mail feature by which a user may channel his request via a predetermined route of intermediaries entered via a purpose-built interface on the email client. The request will reside at transitbox of an intermediary for a user-specified transit time. Our transit server will have the task to monitor the transit time of a transit mail at an intermediary. Another task of the transit server is to relay a mail from one intermediary to another when former responds within the transit time or when the transit time expires. At his transitbox, the user may read the comments from past intermediaries as well as post his own. This process of email transiting is significantly more convenient, temper-proof and traceable, making it a very desired feature in an email service for many organisations.
Efficiency lossless data techniques for arabic text compressionijcsit
This document summarizes a study that evaluated the efficiency of the LZW and BWT data compression techniques on Arabic text files of different sizes and categories (vowelized, partially vowelized, unvowelized). It found that the enhanced LZW technique, which took advantage of Arabic letters having a single case, achieved the highest compression ratios. The enhanced LZW performed better than standard LZW and BWT for all categories of Arabic text. While LZW generally compressed Arabic text better than English text, BWT only performed better on vowelized Arabic versus English. The study concluded the enhanced LZW was the most effective technique for compressing Arabic texts.
This paper presents the method of applying speaker-independent and bidirectional speech-to-speech translation system for spontaneous dialogs in real time calling system. This technique recognizes spoken input, analyzes and translates it, and finally utters the translation. The major part of Speech translation comes under Natural language processing. Natural language processing is a branch of Artificial Intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages. Speech Translation involves techniques to translate the spoken sentences from one language to another. The major part of speech translation involves Speech Recognition which is the translation of spoken speech to text and identifying the context and linguistic structure of the input speech. In the current scenario, the machine does not identify whether the given word is in past tense or present tense. By using the algorithm, we search for a word to check if it is past or present by searching for the sub strings, as “ed”, ”had”, ”Done”, etc., This paper gives us an idea on working with API’s to translate the input speech to the required output speech and thus increasing the efficiency of Speech Translation in cellular devices and also a mobile application that will help us to monitor all the audios present in mobile device and translate it into required language.
This document proposes an Android application that uses Huffman encoding to compress SMS messages. It summarizes that Huffman coding assigns shorter code words to more frequently used symbols, allowing SMS text to be compressed. The application requires installation on both the sender and receiver's phones to decompress messages. Testing showed the technique achieved up to 89% compression, reducing the size of example SMS texts. The summary provides an overview of the key points about using Huffman coding for SMS compression and the proposed mobile application.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A spell checker is an application program to
process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic
necessity and a tedious work in any language, so we require
spell checker software to do this, which is the fundamental
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the
work done for a Kannada language. In a software system
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly.
In this paper, we describe some techniques used in
Kannada language by a spell checker. We use NLP, which is
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we
have some modern NLP algorithms based on machine
learning to carry out the work.
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...CSCJournals
Digital text documents have become a significantly important part on the Internet. A large number of users are attracted towards this digital form of text documents. But some security threats also arise concurrently. The digital libraries offer effective ways to access educational materials, government e-documents, financial documents, social media contents and many others. However content authorship and tamper detection of all these digital text documents require special attention. Till now, considerably very few digital watermarking techniques exist for text documents. In this paper, we propose a method for effective watermarking of Hindi language text documents. Hindi stands second among all languages across the world. It has widespread availability of its digital contents of various types. In proposed technique, the watermark is logically embedded in the text using 'swar' (vowel) as a special feature of the Hindi language, supported by suitable encryption. In extraction phase the Certificate Authority (CA) plays an important role in the authorship protection process as a trusted third party. The text is decrypted and watermark is extracted to prove genuine authorship. Our technique has been tested for various types of feasible text attacks with different embedding frequency.
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
This paper presents several techniques used in text steganography in term of feature-based and word-rule based. Additionally, it analyses the performance and the metric evaluation of the techniques used in text steganography. This paper aims to identify the main techniques of text steganography, which are feature-based, and word-rule based, to recognize the various techniques used with them. As a result, the primary technique used in the text steganography was feature-based technique due to its simplicity and secured. Meanwhile, the common parameter metrics utilized in text steganography were security, capacity, robustness, and embedding time. Future efforts are suggested to focus on the methods used in text steganography.
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
Punjabi Text Classification is the process of assigning predefined classes to the unlabelled text
documents. Because of dramatic increase in the amount of content available in digital form, text
classification becomes an urgent need to manage the digital data efficiently and accurately. Till
now no Punjabi Text Classifier is available for Punjabi Text Documents. Therefore, in this
paper, existing classification algorithm such as Naïve Bayes, Centroid Based techniques are
used for Punjabi Text Classification. And one new approach is proposed for the Punjabi Text
Documents which is the combination Naïve Bayes (to extract the relevant features so as to
reduce the dimensionality) and Ontology Based Classification (that act as text classifier that
used extracted features). These algorithms are performed over 184 Punjabi News Articles on
Sports that classify the documents into 7 classes such as ਿਕਕਟ (krikaṭ), ਹਾਕੀ (hākī), ਕਬੱ ਡੀ
(kabḍḍī), ਫੁਟਬਾਲ (phuṭbāl), ਟੈਿਨਸ (ṭainis), ਬੈਡਿਮੰ ਟਨ (baiḍmiṇṭan), ਓਲੰ ਿਪਕ (ōlmpik).
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
The document describes a framework for reconstructing hidden emails from email folders by identifying quoted fragments and using a precedence graph to represent relationships between emails. It introduces optimizations like email filtering using word indexing and LCS anchoring using indexing to handle large folders and long emails efficiently. An evaluation on the Enron dataset showed the framework could reconstruct hidden emails for many users, and optimizations improved effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Review on feature-based method performance in text steganographyjournalBEEI
The implementation of steganography in text domain is one the crutial issue that can hide an essential message to avoid the intruder. It is caused every personal information mostly in medium of text, and the steganography itself is expectedly as the solution to protect the information that is able to hide the hidden message that is unrecognized by human or machine vision. This paper concerns about one of the categories in steganography on medium of text called text steganography that specifically focus on feature-based method. This paper reviews some of previous research effort in last decade to discover the performance of technique in the development the feature-based on text steganography method. Then, ths paper also concern to discover some related performance that influences the technique and several issues in the development the feature-based on text steganography method.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
The effect of training set size in authorship attribution: application on sho...IJECEIAES
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviours. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%.
1) The paper proposes an efficient Tamil text compaction system that reduces Tamil text to around 40% of the original by identifying word categories and mapping words to compact forms while maintaining meaning.
2) The system handles common Tamil words, abbreviations/acronyms, and numbers by using a morphological analyzer to identify word roots and a generator to re-add suffixes. Compact forms are retrieved from mappings stored in data structures like trees and hashmaps.
3) Testing on over 10,000 words showed the final text was reduced to 40% of the original size, providing a more efficient way to communicate in Tamil on platforms with character limits like social media and text messages.
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document discusses font and size identification in Telugu printed documents. It provides background on the Telugu script, which contains a large number of compound characters formed from combinations of vowels and consonants. The document then discusses the need for font and size identification as a preprocessing step for optical character recognition (OCR) systems to improve accuracy. It presents an approach using zonal analysis and connected component analysis to extract features from text images like aspect ratio and pixel ratio to identify the font and size by comparing to a database. Results showed this approach could accurately identify different fonts and sizes in Telugu text images.
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone. The newmethod proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
This document describes a study on improving Tamil-English cross-language information retrieval through transliteration generation and mining techniques. The study achieved a peak Mean Average Precision of 0.5133 for monolingual English retrieval and 0.4145 for Tamil-English cross-language retrieval, representing an improvement over baselines without handling out-of-vocabulary terms. Transliteration mining performed better than generation at resolving out-of-vocabulary terms and boosting retrieval performance.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Through-Mail Feature: An Enhancement to Contemporary Email ServicesIRJESJOURNAL
ABSTRACT: - in many organisations where there exists several levels of hierarchy, there is often the need to route a request from one level to the next until it reaches the last level where a decision is ultimately taken. In contemporary email services, this will be usually achieved by composing a mail and either forwarding it from one intermediary to another, or carbon-copying to all intermediaries. Unfortunately, these options have several drawbacks one of which is that the content of the original request can be modified by any member in the route. In this paper, we add a through-mail feature by which a user may channel his request via a predetermined route of intermediaries entered via a purpose-built interface on the email client. The request will reside at transitbox of an intermediary for a user-specified transit time. Our transit server will have the task to monitor the transit time of a transit mail at an intermediary. Another task of the transit server is to relay a mail from one intermediary to another when former responds within the transit time or when the transit time expires. At his transitbox, the user may read the comments from past intermediaries as well as post his own. This process of email transiting is significantly more convenient, temper-proof and traceable, making it a very desired feature in an email service for many organisations.
Efficiency lossless data techniques for arabic text compressionijcsit
This document summarizes a study that evaluated the efficiency of the LZW and BWT data compression techniques on Arabic text files of different sizes and categories (vowelized, partially vowelized, unvowelized). It found that the enhanced LZW technique, which took advantage of Arabic letters having a single case, achieved the highest compression ratios. The enhanced LZW performed better than standard LZW and BWT for all categories of Arabic text. While LZW generally compressed Arabic text better than English text, BWT only performed better on vowelized Arabic versus English. The study concluded the enhanced LZW was the most effective technique for compressing Arabic texts.
This paper presents the method of applying speaker-independent and bidirectional speech-to-speech translation system for spontaneous dialogs in real time calling system. This technique recognizes spoken input, analyzes and translates it, and finally utters the translation. The major part of Speech translation comes under Natural language processing. Natural language processing is a branch of Artificial Intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages. Speech Translation involves techniques to translate the spoken sentences from one language to another. The major part of speech translation involves Speech Recognition which is the translation of spoken speech to text and identifying the context and linguistic structure of the input speech. In the current scenario, the machine does not identify whether the given word is in past tense or present tense. By using the algorithm, we search for a word to check if it is past or present by searching for the sub strings, as “ed”, ”had”, ”Done”, etc., This paper gives us an idea on working with API’s to translate the input speech to the required output speech and thus increasing the efficiency of Speech Translation in cellular devices and also a mobile application that will help us to monitor all the audios present in mobile device and translate it into required language.
This document proposes an Android application that uses Huffman encoding to compress SMS messages. It summarizes that Huffman coding assigns shorter code words to more frequently used symbols, allowing SMS text to be compressed. The application requires installation on both the sender and receiver's phones to decompress messages. Testing showed the technique achieved up to 89% compression, reducing the size of example SMS texts. The summary provides an overview of the key points about using Huffman coding for SMS compression and the proposed mobile application.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A spell checker is an application program to
process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic
necessity and a tedious work in any language, so we require
spell checker software to do this, which is the fundamental
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the
work done for a Kannada language. In a software system
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly.
In this paper, we describe some techniques used in
Kannada language by a spell checker. We use NLP, which is
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we
have some modern NLP algorithms based on machine
learning to carry out the work.
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...CSCJournals
Digital text documents have become a significantly important part on the Internet. A large number of users are attracted towards this digital form of text documents. But some security threats also arise concurrently. The digital libraries offer effective ways to access educational materials, government e-documents, financial documents, social media contents and many others. However content authorship and tamper detection of all these digital text documents require special attention. Till now, considerably very few digital watermarking techniques exist for text documents. In this paper, we propose a method for effective watermarking of Hindi language text documents. Hindi stands second among all languages across the world. It has widespread availability of its digital contents of various types. In proposed technique, the watermark is logically embedded in the text using 'swar' (vowel) as a special feature of the Hindi language, supported by suitable encryption. In extraction phase the Certificate Authority (CA) plays an important role in the authorship protection process as a trusted third party. The text is decrypted and watermark is extracted to prove genuine authorship. Our technique has been tested for various types of feasible text attacks with different embedding frequency.
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
This paper presents several techniques used in text steganography in term of feature-based and word-rule based. Additionally, it analyses the performance and the metric evaluation of the techniques used in text steganography. This paper aims to identify the main techniques of text steganography, which are feature-based, and word-rule based, to recognize the various techniques used with them. As a result, the primary technique used in the text steganography was feature-based technique due to its simplicity and secured. Meanwhile, the common parameter metrics utilized in text steganography were security, capacity, robustness, and embedding time. Future efforts are suggested to focus on the methods used in text steganography.
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
Punjabi Text Classification is the process of assigning predefined classes to the unlabelled text
documents. Because of dramatic increase in the amount of content available in digital form, text
classification becomes an urgent need to manage the digital data efficiently and accurately. Till
now no Punjabi Text Classifier is available for Punjabi Text Documents. Therefore, in this
paper, existing classification algorithm such as Naïve Bayes, Centroid Based techniques are
used for Punjabi Text Classification. And one new approach is proposed for the Punjabi Text
Documents which is the combination Naïve Bayes (to extract the relevant features so as to
reduce the dimensionality) and Ontology Based Classification (that act as text classifier that
used extracted features). These algorithms are performed over 184 Punjabi News Articles on
Sports that classify the documents into 7 classes such as ਿਕਕਟ (krikaṭ), ਹਾਕੀ (hākī), ਕਬੱ ਡੀ
(kabḍḍī), ਫੁਟਬਾਲ (phuṭbāl), ਟੈਿਨਸ (ṭainis), ਬੈਡਿਮੰ ਟਨ (baiḍmiṇṭan), ਓਲੰ ਿਪਕ (ōlmpik).
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
The document describes a framework for reconstructing hidden emails from email folders by identifying quoted fragments and using a precedence graph to represent relationships between emails. It introduces optimizations like email filtering using word indexing and LCS anchoring using indexing to handle large folders and long emails efficiently. An evaluation on the Enron dataset showed the framework could reconstruct hidden emails for many users, and optimizations improved effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Review on feature-based method performance in text steganographyjournalBEEI
The implementation of steganography in text domain is one the crutial issue that can hide an essential message to avoid the intruder. It is caused every personal information mostly in medium of text, and the steganography itself is expectedly as the solution to protect the information that is able to hide the hidden message that is unrecognized by human or machine vision. This paper concerns about one of the categories in steganography on medium of text called text steganography that specifically focus on feature-based method. This paper reviews some of previous research effort in last decade to discover the performance of technique in the development the feature-based on text steganography method. Then, ths paper also concern to discover some related performance that influences the technique and several issues in the development the feature-based on text steganography method.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
The effect of training set size in authorship attribution: application on sho...IJECEIAES
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviours. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%.
1) The paper proposes an efficient Tamil text compaction system that reduces Tamil text to around 40% of the original by identifying word categories and mapping words to compact forms while maintaining meaning.
2) The system handles common Tamil words, abbreviations/acronyms, and numbers by using a morphological analyzer to identify word roots and a generator to re-add suffixes. Compact forms are retrieved from mappings stored in data structures like trees and hashmaps.
3) Testing on over 10,000 words showed the final text was reduced to 40% of the original size, providing a more efficient way to communicate in Tamil on platforms with character limits like social media and text messages.
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements.
The Great Mind Challenge'12
VOICE BASED WEB BROWSER
This document covers the functional and non-functional requirements of the
Voice-Based Web Browser including the physical description of the system as well as the
behavioral and other factors necessary to provide a complete and comprehensive
description of the Voice-Based Web Browser.
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements. This paper gives complete guidelines for authors submitting papers for the AIRCC
Journals.
This document presents a voice-based billing system that takes voice input from customers on purchased items and quantities and generates an itemized bill. It consists of three main modules: 1) A speech-to-text module that converts voice input into text using Google APIs. 2) A word tokenization module that splits the text into individual words using NLTK. 3) A bill generation module that takes the tokenized words as input to calculate the total bill amount. The system was tested on purchasing various fruits and achieved 90% accuracy compared to existing systems. It aims to reduce time complexity for billing compared to manual entry.
Allin Qillqay A Free On-Line Web Spell Checking Service For QuechuaAndrea Porter
This document describes a web-based spell checking service called "Allin Qillqay!" for the Quechua language. It utilizes existing open-source spell checking technologies for Quechua and integrates them into a web application using the CKEditor HTML editor and its Spell Check As You Type (SCAYT) plugin. The service allows for spell checking of Quechua texts within a web browser. It connects the CKEditor client-side interface with server-side spell checking programs for different Quechua varieties through an application programming interface. This provides an online spell checking resource for the Quechua language that improves through community use.
This document describes the development of a Rule-Based Machine Translation system between Assamese and English using the Apertium platform. It discusses the files and tools used, including monolingual dictionaries for Assamese and English, a bilingual dictionary, and transfer rules. The methodology section outlines the Apertium architecture and modules, and describes how the dictionaries and transfer rules were created to translate between the two languages.
The document discusses machine translation (MT) between Arabic and English. It covers several key topics:
1. It outlines the challenges of Arabic natural language processing and MT, including the differences between Modern Standard Arabic and dialects and a lack of annotated resources.
2. It describes different types of MT systems like direct translation engines and those using linguistic knowledge architectures. It also discusses the importance of dictionaries.
3. It discusses common MT problems such as ambiguity and differences between languages.
4. It proposes a small prototype Arabic to English MT model to demonstrate basic techniques like normalization, tokenization, stemming and using a parser and transformation rules.
This document provides an overview of machine translation and the Moses machine translation toolkit. It defines machine translation and statistical machine translation. It describes the major components of Moses, including GIZA++ for word alignment, SRILM for language modeling, and the Moses decoder. It explains how Moses uses phrase-based translation and tuning to produce translations. It also discusses how to set up and use a Moses server for translating webpages.
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAFcsandit
This paper presents an intelligent tool to summarize Arabic articles and translate the summary
to Arabic sign language based on avatar technology. This tool was designed for improving web
accessibility and help deaf people for acquiring more benefits. Of the most important problems
in this task, is that the deaf people were facing many difficulties in reading online articles
because of their limited reading skills. To solve this problem the proposed tool includes a
summarizer that summaries the articles, so deaf people will read the summary instead of
reading the whole article .Also the tool use avatar technology to translate the summary to sing
language.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
The document discusses the functions and purposes of translators in computing. It describes:
1) Interpreters and compilers translate programs from high-level languages to machine code. Compilers translate the entire program at once, while interpreters translate instructions one at a time as the program runs.
2) Translation from high-level languages to machine code involves multiple stages including lexical analysis, syntax analysis, code generation, and optimization.
3) Linkers and loaders are used to combine separately compiled modules into a complete executable program by resolving addresses and linking the modules together.
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET Journal
This document describes the development of a spell checker for the Hindi language. It discusses the importance of spell checkers for digitizing languages and some common techniques used in spell checking like n-gram analysis, edit distance algorithms, and probabilistic methods. The proposed system will use a corpus of Hindi text to build a language model and detect spelling errors. It will generate candidate corrections based on edit distance and rank them using n-gram frequency analysis. The goal is to develop a tool that can check for both non-word errors and real word errors in Hindi text.
English to punjabi machine translation system using hybrid approach of word sIAEME Publication
This document describes a hybrid machine translation and word sense disambiguation system for translating English sentences to Punjabi. The system uses conditional random fields to disambiguate words with multiple meanings by determining the category with the highest word frequency in the context. The system achieves 81.2% accuracy on test sentences by first analyzing sentences, then synthesizing the translation while addressing word ambiguities, and finally outputting the translated sentence.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
This document provides an overview of compilers and their various phases. It begins by introducing compilers and their importance for increasing programmer productivity and enabling reverse engineering. It then covers the classification of programming languages and the history of compilers. The rest of the document details each phase of the compiler process, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, code generation, and the role of the symbol table. It provides definitions and examples for each phase to explain how a source program is translated from a high-level language into executable machine code.
On Developing an Automatic Speech Recognition System for Commonly used Englis...rahulmonikasharma
Speech is one of the easiest and the fastest way to communicate. Recognition of speech by computer for various languages is a challenging task. The accuracy of Automatic speech recognition system (ASR) remains one of the key challenges, even after years of research. Accuracy varies due to speaker and language variability, vocabulary size and noise. Also, due to the design of speech recognition that is based on issues like- speech database, feature extraction techniques and performance evaluation. This paper aims to describe the development of a speaker-independent isolated automatic speech recognition system for Indian English language. The acoustic model is build using Carnegie Mellon University (CMU) Sphinx tools. The corpus used is based on Most Commonly used English words in everyday life. Speech database includes the recordings of 76 Punjabi Speakers (north-west Indian English accent). After testing, the system obtained an accuracy of 85.20 %, when trained using 128 GMMs (Gaussian Mixture Models).
This document summarizes a research paper on developing a speech-to-text conversion system for visually impaired people using μ-law companding. The system uses MATLAB to analyze input speech signals, extract features, filter noise, and match signals to samples stored in a database to convert speech to text. A graphical user interface was created to input speech and display recognition results. The system achieved real-time speech recognition and conversion to text with high accuracy using μ-law companding techniques for signal processing and correlation comparisons to the stored samples.
Speech to text conversion for visually impaired person using µ law compandingiosrjce
The paper represents the overall design and implementation of DSP based speech recognition and
text conversion system. Speech is usually taken as a preferred mode of operation for human being, This paper
represent voice oriented command for converting into text. We intended to compute the entire speech processing
in real time. This involves simultaneously accepting the input from the user and using software filters to analyse
the data. The comparison was then to be established by using correlation and µ law companding techniques. In
this paper, voice recognition is carried out using MATLAB. The voice command is a person independent. The
voice command is stored in the data base with the help of the function keys. The real time input speech received
is then processed in the speech recognition system where the required feature of the speech words are extracted,
filtered out and matched with the existing sample stored in the database. Then the required MATLAB processes
are done to convert the received data and into text form.
This document is a B Tech project report submitted by Abhishek Agarwal in April 2005 at the Dhirubhai Ambani Institute of Information & Communication Technology. The project aims to develop machine understanding of Indian spoken languages. It discusses work done in language identification based on phonetic characteristics. The report covers background on language identification systems, objectives of the project, a discussion on developing language models for the identification process, and proposes tools that could utilize language identification algorithms.
2. which he/she has written from her own mental mapping. Therefore,
there may be some errors in user provided abbreviation. To know
what type of errors we are dealing with we conducted a survey on
people who weren’t well versed in English literature. We asked
several queries [4] from each of them. We found that following
types of error are primarily committed by users:
a) Extra Vowel: e.g., SMS may be written as ASAMAS,
ESEMES. SP may be written as ASP, ESPI etc.
b) Vowel Deficiency: e.g., IAS may be written as AS. AIDS
may be written as ADS.
c) Extra Consonant usage: e.g., CNG may be written as
CNGG.
d) Wrong Consonant usage: e.g., B.Tech may be written as
V.Tech. KVPY may be written as KBPY.
e) Other errors: Vowel replacement error e.g., CEO may be
written as CIO. Typing error e.g., IIPQ may be written as
IIPO. Same Phonetic sound error e.g., UPSC may be
written as UPSE, NEWS may be written as NEUS.
Figure 2 shows percentage of these type of errors.
We focused mainly on the first and second kind of error as they
were committed most frequently. We applied some heuristic
techniques based on N-gram (Inverted Index) and edit-distance to
match with the database entries of abbreviations, collected initially
with prior estimation. Manually, we created a list from various
online sources. It contained all kind of abbreviations ranging from
governmental organization to education field. We also created list
of common Hindi terms which are used in framing ‘Wh’-type
questions (e.g. ‘kya’ (क्या), ‘kee’ (कि), ‘kyon’(क्ययों),’kaahaan’
(िहाों) etc.).
To find matches for the queries containing abbreviations (possibly
with spelling errors), we stored the 1500 words sparsed in a huge
number of files. Basically we used Inverted Index technique to store
and find the correct abbreviations from these stored files. In the
abbreviations database considered, we tried to find all possible
character bigrams. For example, in abbreviation ‘SAARC’ 4 bi-
gram entries will be SA, AA, AR, and RC. For each such bigram,
we created a file and stored the word ‘SAARC’ in all the files. In
precise terms, the word ‘SAARC’ will appear in four files namely
SA.txt, AA.txt, AR.txt, and RC.txt. Similar exercise is carried out
for each abbreviation. There can be a maximum of 26x26
possibilities of bigram files, and therefore 26x26 files are created.
Various files will contain data for number of abbreviations.
However, there would be some which would be empty. Typically
each single file contains zero to only a few entries [5]. Similarly we
did for character trigrams leading to 26x26x26 files.
For Hindi words like ‘KYA’ (क्या) (meaning ‘what’ in English),
KY, YA are generated in bigram. There is only one trigram file
namely ‘KYA’. Now, as the application receives the query, it scans
all the words. Since user input is in mix language (we considered
Hindi and English written in Roman script) and potentially contain
a lot of typos and wrong spellings. We first eliminate vowels from
each word and then search through the files generated by N-gram
technique. Each input word is mapped to the word in the list (both
Hindi and English list) which is having the highest frequency from
the bigram and trigram files. In case of several matches, the word
having least modification is chosen, with ties broken arbitrarily. Let
us illustrate the algorithm with an example query “aaiaarctc kay
hai” (आईआरसीटीसी क्या है) (means “what is ‘aaiaarctc’?”).
Table 1. Input Processing Data
Word(o)
Vowel
Removed(w)
#char rem. (p)
AAIAARCTC RCTC 4
KYA KY 2
HAI H 1
The query is converted to uppercase and then following steps are
performed for each word:
a) Vowels are removed because of spell-errors occurring in
the use of vowels
b) Number of remaining letters (p) checked to take
following decisions:
i) If (p>=3) check only tri-gram files for a word
w or its tri-gram subsequences. Here, file
named RCT and CTC are searched to check
whether the words contain some subsequences
of RCTC. If there is matching entry in all such
files, that particular abbreviation is chosen. If
there are more than one entries which contains
this subsequence, then we choose the word
with highest number of occurrence among
these n-gram files. Least edit distance from the
query word (o) is used to break the ties and then
arbitrarily any word from the set of final word
is chosen.
ii) If (p == 2) check only bi-gram files for word
w or its bi-gram subsequence as above. Here,
file name KY is selected and is searched to
check whether the file contains some
subsequences of KY. If there is a matching
entry, that particular abbreviations is chosen.
iii) If (p<2) check only bi-gram files for
original word before vowel removal (o) or its
bi-gram subsequences. Hence file HA and AI
are searched to check whether the files contain
some subsequence of HAI to get required
matching word.
c) We check the chosen word. The word may come from
either Hindi or English or both kind of files:
Figure 2. Types of Error
3. i) Word is from Hindi file only: ignore the
word as it is not abbreviation-word.
ii) Word is from English file only: the word
should be further processed as it is an assumed
acronym.
iii) Two different words are returned from both
Hindi and English files: Chose the word having
minimum edit distance with original input and
then consider as either i) or ii) case.
d) The chosen abbreviations is searched through our
collection of abbreviations and then expanded form is
extracted. The steps are summarized in Figure 3.
We assumed that a single query can have maximum one acronym
supplemented with zero or more Hindi words.
2.2 System Efficiency
We tested our system on the user data we collected and these are
the results we obtained after considering two different options.
a) Removal of vowels: We removed the vowels from the
user query before searching it.
b) Considering vowels: We search directly without removal
of vowels.
Figure 4 shows the results.
2.3 Data Extraction From Wikipedia
Given a chosen acronym we first look up in our associative array
of collected acronyms with their expanded counterparts. The
expansion is used to generate the URL to Web harvest API (open
source) which returns content of that Wikipedia page in XML
format. Since we need to provide only a brief definition or
introduction that can with in a SMS for an abbreviations, we are
interested only in first few lines (1 or 2 sentences) up to 200
characters from Wiki page. The content so obtained is stored in a
temporary file.
2.4 Translation of Data
The stored data is sent to google translator and then we use Yahoo
Query Language (YQL) [6] to retrieve the translation. We use Web
Harvest API once again to extract the translated text. The extracted
content is stored to another temporary file.
2.5 Transliteration of Data
For transliteration purpose ICU4J [7] library is used. ICU4J is set
of Java libraries that provides more comprehensive support for
Unicode, software globalization and internationalization. The
translated data is passed to a method using this library and
transliterated text is generated which is provided to the user through
a Java Applet. The output is shown in Figure 5.
3. CONCLUSION
We developed a prototype application which can be used by mobile
service providers to cate the information need of their customers
through SMS. Specifically, we attempt to address the need of semi-
literate users using low-cost mobile phones in which neither
internet facility nor local language support is available. We made
use of translation and transliteration using Web application and
libraries along the Web scrapping technique to process the need and
provide the answer in native language using Roman script. We
believe this software can be immensely useful for information
dissemination and access where Internet penetration is low and
people’s knowledge of English is limited.
4. ACKNOWLEDGEMENTS
We wish to thank Divesh Sanjay Kothari and Abhinay Saraswat,
Department of CSE, ISM Dhanbad for their all-round help.
5. ADDITIONAL AUTHORS
Ashok Kumar (ISM Dhanbad, ashokdavas@gmail.com), L.
Gautam (ISM Dhanbad, gtam25@gmail.com), Abhishek Ranjan
(ISM Dhanbad, aksharudarya@gmail.com )
Figure 4. System Efficiency
Figure 3. Input Processing Steps
.
Figure 5. Input Output Panel
.
4. 6. REFERENCES
[1] S. Lothia , W. James, and B. Hwang. System and methods for
providing subscriber-initiated information over the short
message service(SMS) or a micro browser, May 6 2003. US
Patent 6,560,456.
[2] J. Salonen, SMS inquiry and invitation distribution method
and system, Mar. 12, 2013. US Patent RE44, 073.
[3] V. Nikic and A. Wajda. Web Harvest, version 2.0, February
2010. As on June 25, 2014.
[4] User Query Data :
https://www.dropbox.com/s/kbwabem29f0mwu9/data.txt?dl
=0
[5] S. K. D S Kothari, A Saraswat and S. Pal. FAQ Retrieval
using Noisy Queries. In Fire 2013 Workshop Pre-
Proceedings, December 2013.
[6] YQL Console: https://developer.yahoo.com/yql/
[7] ICU User Guide as on June 25, 2014.
[8] Video Demo Link:
https://www.dropbox.com/s/l270iq3gnhgafvy/PARTM.mp4?
dl=0