The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
A spell checker is an application program to
process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic
necessity and a tedious work in any language, so we require
spell checker software to do this, which is the fundamental
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the
work done for a Kannada language. In a software system
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly.
In this paper, we describe some techniques used in
Kannada language by a spell checker. We use NLP, which is
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we
have some modern NLP algorithms based on machine
learning to carry out the work.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
Transliteration may be defined as the process of mapping sounds in a text written in one language to
another language. Current paper discusses about transliteration and its use in Named Entity Recognition.
We have designed a code that executes Transliteration and assist in the process of Named Entity
Recognition. We have presented some of the results of Named Entity Recognition (NER) using
Transliteration
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
A spell checker is an application program to
process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic
necessity and a tedious work in any language, so we require
spell checker software to do this, which is the fundamental
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the
work done for a Kannada language. In a software system
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly.
In this paper, we describe some techniques used in
Kannada language by a spell checker. We use NLP, which is
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we
have some modern NLP algorithms based on machine
learning to carry out the work.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
Transliteration may be defined as the process of mapping sounds in a text written in one language to
another language. Current paper discusses about transliteration and its use in Named Entity Recognition.
We have designed a code that executes Transliteration and assist in the process of Named Entity
Recognition. We have presented some of the results of Named Entity Recognition (NER) using
Transliteration
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation;
tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotation
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scrip
ts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that
converts
the transliterated word or phrase back int
o its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a
computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intellig
ence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a
data
oriented statistical framewo
rk for translating text from one natural language to another based on the knowledge
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
A large amount of data in Indian languages stored digitally is in ASCII-based font formats. ASCII has 128
character-set, therefore it is unable to represent all the characters necessary to deal with the variety of
scripts available worldwide. Moreover, these ASCII-based fonts are not based on a single standard
mapping between the character-codes and the individual characters, for a particular Indian script, unlike
the English language fonts based on the standard ASCII mapping. Therefore, it is required that the fonts for
a particular script must be available on the system to accurately represent the data in that script. Also, the
conversion of data in one font into another is a difficult task. The non-standard ASCII-based fonts also pose
problems in performing search on texts in Indian languages available over web. There are 25 official
languages in India, and the amount of digital text available in ASCII-based fonts is much larger than the
text available in the standard ISCII (Indian Script Code for Information Interchange) or Unicode formats.
This paper discusses the work done in the field of font-detection (to identify the font of the given text) and
font-converters (to convert the ASCII-format text into the corresponding Unicode text).
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation;
tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotation
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scrip
ts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that
converts
the transliterated word or phrase back int
o its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a
computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intellig
ence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a
data
oriented statistical framewo
rk for translating text from one natural language to another based on the knowledge
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
A large amount of data in Indian languages stored digitally is in ASCII-based font formats. ASCII has 128
character-set, therefore it is unable to represent all the characters necessary to deal with the variety of
scripts available worldwide. Moreover, these ASCII-based fonts are not based on a single standard
mapping between the character-codes and the individual characters, for a particular Indian script, unlike
the English language fonts based on the standard ASCII mapping. Therefore, it is required that the fonts for
a particular script must be available on the system to accurately represent the data in that script. Also, the
conversion of data in one font into another is a difficult task. The non-standard ASCII-based fonts also pose
problems in performing search on texts in Indian languages available over web. There are 25 official
languages in India, and the amount of digital text available in ASCII-based fonts is much larger than the
text available in the standard ISCII (Indian Script Code for Information Interchange) or Unicode formats.
This paper discusses the work done in the field of font-detection (to identify the font of the given text) and
font-converters (to convert the ASCII-format text into the corresponding Unicode text).
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
- Які переваги мають успішні компанії в конкурентній боротьбі?
- Які технології творять майбутнє?
- Як Вашому бізнесу бути на вістрі прогресу?
- Як стати новатором, а не аутсайдером бізнесу?
- Як людство дійшло до епохи технологічного прогресу?
На ці та багато інших запитань Ви можете отримати відповіді у цій доповіді...
Arabic tweeps dialect prediction based on machine learning approach IJECEIAES
In this paper, we present our approach for profiling Arabic authors on Twitter, based on their tweets. We consider here the dialect of an Arabic author as an important trait to be predicted. For this purpose, many indicators, feature vectors and machine learning-based classifiers were implemented. The results of these classifiers were compared to find out the best dialect prediction model. The best dialect prediction model was obtained using random forest classifier with full forms and their stems as feature vector.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
Identification of scripts from multi-script document is one of the important steps in the design
of an OCR system for successful analysis and recognition. Most optical character recognition
(OCR) systems can recognize at most a few scripts. But for large archives of document images
containing different scripts, there must be some way to automatically categorize these
documents before applying the proper OCR on them. Much work has already been reported in
this area. In the Indian context, though some results have been reported, the task is still at its
infancy. This paper presents a research in the identification of Tamil, English and Hindi
scripts at word level irrespective of their font faces and sizes. It also identifies English
numerals from multilingual document images. The proposed technique performs document
vectorization method which generates vectors from the nine zones segmented over the
characters based on their shape, density and transition features. Script is then determined by
using Rule based classifiers and its sub classifiers containing set of classification rules which
are raised from the vectors. The proposed system identifies scripts from document images
even if it suffers from noise and other kinds of distortions. Results from experiments,
simulations, and human vision encounter that the proposed technique identifies scripts and numerals with minimal pre-processing and high accuracy. In future, this can also be extended for other scripts.
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies hould consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
FIRE2014_IIT-P
1. Machine Learning Approach for Language Identification &
Transliteration: Shared Task Report of IITP-TS
Deepak Kumar Gupta
Comp. Sc. & Engg. Deptt.
IIT Patna, India
deepak.mtmc13@iitp.ac.in
Shubham Kumar
Comp. Sc. & Engg. Deptt.
IIT Patna, India
shubham.ee12@iitp.ac.in
Asif Ekbal
Comp. Sc. & Engg. Deptt.
IIT Patna, India
asif@iitp.ac.in
ABSTRACT
In this paper, we describe the system that we developed as
part of our participation to the FIRE-2014 Shared Task on
Transliterated Search. We participated only for Subtask 1
that focused on labeling the query words. The entire pro-
cess consists of the following subtasks: language identifica-
tion of each word in the text, named entity recognition and
classification (NERC) and transliteration of the Indian lan-
guage words written in non-native scripts to the correspond-
ing native Indic scripts. The proposed methods of language
identification and NERC are based on the supervised ap-
proaches, where we use several machine learning algorithms.
We develop a transliteration framework which is based on
the modified joint source channel model. Experiments on
the benchmark setup show that we achieve quite encourag-
ing performance for both pairs of languages. It is also to
be noted that we did not make use of any deep domain-
specific resources and/or tools, and therefore this can be
easily adapted to the other domains and/or languages.
Keywords
Language Identification, NERC, Transliteration, Ensemble,
Modified Joint-Source Channel Model
1. INTRODUCTION
Recent decade has seen an upsurge in the social network-
ing and e-commerce sector witnessing an enormous growth
in the volume of data flowing out of media networks, which
can be used by public and private organizations alike to
gain valuable insights. New forms of communication, such
as micro-blogging, Tweets, status, reviews and text messag-
ing have emerged and become ubiquitous. These messages
often are written using Roman script due to various socio-
cultural and technological reasons [4]. Many languages such
as South and South East Asian Languages, Arabic, Russian
etc. make use of indigenous scripts while writing in text
forms. The process of phonetically representing the words
of a language in a non-native script is called transliteration.
Transliteration, especially into Roman script, is used abun-
dantly on the Web not only for documents, but also for user
queries that intend to search for these documents. These
problems were addressed in the FIRE-2013 Shared Task on
Transliteration [5]. More recent studies show that building
computational models for the social media content is more
challenging because of the nature of the mixing as well as the
presence of non-standard variations in spellings and gram-
mar, and transliteration [1]. The work that we present here
is in connection to the shared task that is being conducted
as an continuation to the previous year.
1.1 Task Description
This year, two subtasks on Transliterated Search were con-
ducted: first one is the query labeling, and the second task
is related to the ad-hoc retrieval task for Hindi film lyrics.
We participated for the first task which is described very
briefly as follows:
Subtask 1: Query Word Labeling
Suppose that a query q: w1 w2 w3 . . . wn is written in the
Roman script. The words, w1 w2 etc., could be standard En-
glish words or transliterated from another language L. The
task is to label the words as E or L depending on whether
it is an English word, or a transliterated L-language word.
And then, for each transliterated word, provide the correct
transliteration in the native script (i.e., the script which is
used for writing L). The task also required to identify and
classify the named entities of types person, location, organi-
zation and abbreviation.
2. METHODOLOGY
The overall task for query labeling consists of three ma-
jor components, viz. Language Identification, NERC and
Transliteration. It is to be noted that we did not make use
of any domain-specific resources and/or tools for the sake of
their domain-independence. Below we describe the method-
ologies that we followed for each of these individual modules.
2.1 Language Identification
The problem of language identification concerns with de-
termining the language of a given word. The task can be
modeled as a classification problem, where each word has
to be labeled either with one of the three classes, namely
Hindi (or Bengali), English and Mixed (denotes the mixed
characters of English and non-Latin language scripts). Our
proposed method for language identification is supervised in
nature. In particular we develop the systems based on four
2. different classifiers, namely random forest, random tree,
support vector machine and decision tree. We use the
Weka implementations1
for these classifiers. In order to fur-
ther improve the performance we construct an ensemble by
combining the decisions of all the classifiers using majority
voting. We followed the similar approach for both Hindi-
English and Bangla-English pairs. The features that we im-
plemented for language identification are described below in
brief:
1. Character n-gram: Character n-gram is a contigu-
ous sequence of n character extracted from a given
word. We extract character n-grams of length one
(unigram), two (bigram) and three (trigram), and use
these as features of the classifiers.
2. Context word: Local contexts help to identify the
type of the current word. We use the contexts of pre-
vious two and next two words as features.
3. Word normalization : Words are normalized in or-
der to capture the similarity between two different
words that share some common properties. Each cap-
italized letter is replaced by ‘A’, small by ’a’ and num-
ber by ’0’.
4. Gazetteer based feature : We compile a list of
Hindi, Bengali and English words from the training
datasets. A feature vector of length two (representing
the respective gazetteer for the language pair: Hindi-
English or Bangla-English). Now for each token we set
a feature value equal to ‘1’ if it is present in the respec-
tive gazetteer, otherwise ‘0’. Hence for the words that
appear in both the gazetteers, the feature vector will
take the values of 1 in both the bit positions. Recent
studies also suggest that gazetteer based features can
be effectively used for the language identification [3].
5. InitCap: This feature checks whether the current to-
ken starts with a capital letter.
6. InitPunDigit: We define a binary-valued feature that
checks whether the current token starts with a punc-
tuation or digit.
7. DigitAlpha: We define this feature in such a way that
checks whether the current token is alphanumeric.
8. Contains# symbol: We define the feature that checks
whether the word in the surrounding context contains
the symbol #.
Last three features help to recognize the tokens which are
mixed in nature (i.e., do not belong to Hindi, English and
Bangla). Some of the examples are: 2mar, #lol, (rozana
etc.
2.2 Named Entity Recognition and Classifica-
tion
Named Entity Recognition and Classification (NERC) in an
unstructured texts such as facebook, blogs etc. are more
challenging compared to the traditional news-wire domains.
1
http://www.cs.waikato.ac.nz/ml/weka/
Here the task was to identify named entities (NEs) and clas-
sify them into the following categories: Person, Organi-
zation, Location and Abbreviation. We use machine
learning model to recognize the first three NE types, and
for the last one we used heuristics. It is to be noted that
there were many inconsistencies in annotation, and hence we
pre-processed the datasets to maintain uniformity. In order
to denote the boundary of NEs we use the BIO encoding
scheme 2
. We implement the following features for NERC.
1. Local context: Local contexts that span the preced-
ing and following few tokens of the current word are
used as the features. Here we use the previous two and
next two tokens as the features.
2. Character n-gram: Similar to the language identifi-
cation we use n-grams of length upto 5 as the features.
3. Prefix and Suffix: Prefix and suffix of fixed length
character sequences (here, 3) are stripped from each
token and used as the features of classifier.
4. Word normalization: This feature is defined exactly
in the same way as we did for language identification.
5. WordClassFeature: This feature was defined to en-
sure that the words having similar structures belong
to the same class. In the first step we normalize all
the words following the process as mentioned above.
Thereafter, consecutive same characters are squeezed
into a single character. For example, the normalized
word AAAaaa is converted to Aa. We found this fea-
ture to be effective for the biomedical domain, and we
directly adapted this without any modification. De-
tailed sensitivity analysis might be useful to study its
effectiveness for the current domain.
6. Typographic features: We define a set of features
depending upon the Typographic constructions of the
words. We implement the following four features: All-
Caps (whether the current word is made up of all cap-
italized letters), AllSmall (word is constructed with
only uncapitalized characters), InitCap (word starts
with a capital letter) and DigitAlpha (word contains
digits and alphabets).
2.3 Transliteration
A transliteration system takes as input a character string in
the source language and generates a character string in the
target language as output. The transliteration algorithm
[2] that we used here is conceptualized as two levels of de-
coding: segmenting source and target language strings into
transliteration units (TUs); and defining appropriate map-
ping between the source and target TUs by resolving differ-
ent combinations of alignments and unit mappings. The TU
is defined based on the regular expression.
For K alligned TUs (X: source TU and T: target TU), we
have
P(X, T) = P(x1, x2 . . . xk, t1, t2 . . . tk)
2
B, I and O denote the beginning, intermediate and outside
the NEs.
3. = P (<x,t>1, <x,t>2, . . . <x,t>k)
= Πk
k=1 P (<x,t>k | <x,t>k−1
1 )
We implement a number of transliteration models that can
generate the original Indian language word (i.e., Indic script)
from the given English transliteration written in Roman
script. The Indic word is divided into TUs that have the
pattern C+
M, where C represents a vowel or a consonant
or conjunct and M represents the vowel modifier or matra.
An English word is divided into TUs that have the pattern
C*V*, where C represents a consonant and V represents a
vowel [2]. The process considers contextual information in
both the source and target sides in the form of collocated
TUs, and compute the probability of transliteration from
each source TU to various target candidate TUs and chooses
the one with maximum probability. The most appropriate
mappings between the source and target TUs are learned au-
tomatically from the bilingual training corpus. The training
process yields a kind of decision-tree list that outputs for
each source TU a set of possible target TUs along with the
probability of each decision obtained from a training cor-
pus. The transliteration of the input word is obtained using
direct orthographic mapping by identifying the equivalent
target TU for each source TU in the input and then placing
the target TUs in order. We implemented all the six mod-
els as proposed in [2]. Based on some experiments on the
held out datasets we selected the following three models to
submit our runs:
Model-I: This is a kind of monogram model where no con-
text is considered, i.e. P(X,T) = Πk
k=1 P(<x,t>k)
Model-II: This model is built by considering next source
TU as context.
P(X,T) = Πk
k=1 P(<x,t>k | xk+1 )
Model-III: This model incorporates the previous and the
next TUs in the source and the previous target TU as the
context.
P(X,T) = Πk
k=1 P(<x,t>k | <x,t>k−1, xk+1)
The overall transliteration process attempts to produce the
best output for the given input word using Model-III. If the
transliteration is not obtained then we consult Model-II and
then Model-I in sequence. If none of these models produces
the output then we consider a literal transliteration model
developed using a dictionary. This process is shown below:
Input: Token (t) which is labeled as L3
in Language
identification
Output: Transliteration (T) of given token
• Step 1: T<- Model-III (t)4
1.1 If T contains null value
1.1.1 T<- Model-III withAlignment(t)5
• Step 2: If T contains null value
3
Denotes Hindi(H) or Bengali(B)
4
each model takes input token and divide it into several non
native TU and give the native TU for each of them
5
each model withAlignment takes input token and align the
source TU with target TU
2.1 T<- Model-II (t)
2.2 If T contains null value
2.2.1 T<- Model-II withAlignment(t)
• Step 3: If T contains null value
3.1 T<- Model-I (t)
3.2 If T contains null value
3.2.1 T<- Model-I withAlignment(t)
• Step 4: return (T)
3. EXPERIMENTS AND DISCUSSIONS
3.1 Datasets
We submitted runs for the two pair of languages, namely
Hindi-English and Bangla-English. For language identifica-
tion the FIRE 2014 organizers provided three documents for
Hindi-English pair and two documents for Bangla-English
pair. For each of the language pairs, the individual docu-
ments are merged together into a single file for training. The
training sets consist of 1,004 (20,658 tokens) and 800 sen-
tences (27,969 tokens) for Hindi-English and Bangla-English
language pairs, respectively. The test sets consist of 32,270
and 25,698 tokens for Hindi-English and Bangla-English, re-
spectively. For training of transliteration algorithm we make
use of 54,791 Hindi-English and 19,582 Bangla-English par-
allel examples. Details of these datasets can be found in
[6].
3.2 Results and Analysis
In this section we report the results that we obtained for
query word labeling. We submitted three runs which are
defined as below:
Run-1: For language identification and NERC, we con-
struct the ensembles using majority voting. If the token
is labeled as any native language (Hindi or Bangla) then we
perform the transliteration for that token.
Run-2: In this run we perform language identification by
majority ensemble, and NERC by SMO. The word labeled
as native language is transliterated accordingly.
Run-3: In this run both the language identification and
NERC are carried out using SMO. Transliteration is done
following the same method.
Run ID LP LR LF EP ER EF LA
Run-1 0.920 0.843 0.880 0.883 0.932 0.907 0.886
Run-2 0.922 0.843 0.881 0.884 0.931 0.907 0.886
Run-3 0.882 0.841 0.861 0.88 0.896 0.888 0.870
Table 1: Result for language identification of
Bangla-English. Here, LP-Language precision, LR-
Language recall, LF-Language FScore, EP-English
precision, ER-English recall, EF-English FScore,
LA-Labelling accuracy
4. Run ID LP LR LF EP ER EF LA
Run-1 0.921 0.895 0.908 0.89 0.908 0.899 0.879
Run-2 0.921 0.893 0.907 0.89 0.908 0.899 0.878
Run-3 0.905 0.865 0.885 0.86 0.886 0.873 0.857
Table 2: Results for language identification of Hindi-
English
Run ID EQMF ALL TP TR TF ETPM
Run-1 0.005 0.039 0.574 0.073 228/337
Run-2 0.005 0.039 0.574 0.073 228/337
Run-3 0.005 0.038 0.582 0.071 231/344
Table 3: Results of transliteration for Bangla-
English. Here, EQMF- ALL-Exact query match
fraction, TP-Transliteration precision, TR-
Transliteration recall, TF-Transliteration F-Score,
ETPM-Exact transliteration pair match
The systems were evaluated using the metrics as defined in
[5]. Overall results for language identification are reported
in Table 1 and Table 2 for Bangla-English and Hindi-English
pairs, respectively. Evaluation shows that the first two runs
achieve quite same level of F-scores. Experimental results
for transliteration are reported in Table 3 and Table 4 for
Bangla-English and Hindi-English, respectively. Compar-
isons with the other submitted systems show that we achieve
the performance levels which are in the upper side. For Hindi
we obtain the precision, recall and F-score of 92.10%, 89.5%
and 90.80%, which are very closer to the best performing
system (only inferior by less than one point F-score). For
English, our system attains the precision, recall and F-score
values of 89.00%, 90.80% and 89.90%, respectively. This is
lower than the best system only by 0.2% F-score points. For
Bangla-English pair our system also performs with impres-
sive F-score values. In a separate experiment we evaluated
the model of NERC. We obtain the precision, recall and
F-score values of 61.00%, 44.25% and 51.25%, respectively
for Hindi-English pair using the ensemble framework. For
Bangla-English pair it yields the precision, recall and F-score
values of 54.25%, 43.25% and 48.25%, respectively.
Our close investigation to the results show that our model
developed for language identification suffers due to very short
forms of the words, ambiguities, errorneous words and mixed
wordforms. Short words refer to the tokens which are writ-
ten in their short forms. Ambiguities arise because of the
words, having meanings in both native and non-native scripts.
Errors are encountered because of the wrong spelling. Also,
the alphanumeric wordforms (i.e., mixed) often contribute
to the overall errors. The errors are shown in Table 5.
The transliteration model makes most of the errors because
of spelling variation (e.g., Pahalaa [ vs. ]). Incon-
sistent annotation is the another potential source of errors.
4. CONCLUSION
Run ID EQMF ALL TP TR TF ETPM
Run-1 0.005 0.146 0.76 0.244 1933/2306
Run-2 0.004 0.146 0.76 0.244 1931/2301
Run-3 0.004 0.143 0.736 0.24 1871/2226
Table 4: Results of transliteration for Hindi-English
Type Words Predicted Reference
Short words thrgh H E
Ambiguous words the;ate E;E H;H
Erroneous words implemnt H E
Mixed Numerals Words 2mar O B
Table 5: Language labeling errors. Here, H-Hindi,
E-English, O-others
In this paper we presented a brief overview of the system
that we developed as part of our participation in the query
labeling subtask of transliteration search. The proposed
method classifies the input query words to their native lan-
guage and back-transliterate non-English words to their orig-
inal script. We have used several classification techniques for
solving the problem of language identification and NERC.
For transliteration we have used a modified joint source-
channel model. We submitted three runs. Comparisons with
the other systems show that our system achieves quite en-
couraging performance for both the pairs, Hindi-English and
Bangla-English.
Our detailed analysis suggest that language identification
module suffers most due to the presence of very short word-
forms, ambiguities and alphanumeric words. Errors encoun-
tered in the transliteration model can be reduced a lot by
developing a method for spelling variations.
5. REFERENCES
[1] K. Bali, J. Sharma, M. Choudhury, and Y. Vyas. “I am
borrowing ya mixing?” An analysis of english-hindi
code mixing in facebook. In First Workshop on
Computational Approaches to Code Switching, EMNLP
2014, page 116, 2014.
[2] A. Ekbal, S. K. Naskar, and S. Bandyopadhyay. A
modified joint source-channel model for transliteration.
Proceedings of the COLING/ACL, pages 191–198, 2006.
[3] C.-C. Lin, W. Ammar, L. Levin, and C. Dyer. The
CMU submission for the shared task on language
identification in code-switched data. page 80, 2014.
[4] R. S. Roy, M. Choudhury, P. Majumder, and
K. Agarwal. Challenges in designing input method
editors for indian languages: The role of word-origin
and context. In Advances in Text Input Methods
(WTIM 2011), pages 1–9, 2011.
[5] R. S. Roy, M. Choudhury, P. Majumder, and
K. Agarwal. Overview and datasets of fire 2013 track
on transliterated search. In FIRE-13 Workshop, 2013.
[6] S. V.B., M. Choudhury, K. Bali, T. Dasgupta, and
A. Basu. Resource creation for training and testing of