This document discusses how utilizing different levels of Arabic morphological knowledge can improve search results in an Arabic search engine. It proposes analyzing user search queries at multiple morphological levels to generate related terms for the search. The levels range from an exact match to broader matches incorporating verb conjugations, noun cases, and ultimately matching to the word root. This approach aims to balance recall and precision by starting with a narrow exact match search and expanding systematically based on morphological relationships.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
The document discusses information retrieval based on word semantics in Arabic texts. It covers several key areas: (1) the challenges of natural language processing in Arabic due to its rich morphology; (2) the process of morphological analysis including preprocessing, stemming, and indexing terms; (3) the research problems of synonymy and polysemy in information retrieval; and (4) semantic approaches to address these problems including automatic discovery of similar words and synonym-based search methods.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
The document discusses the fields of syntax, semantics, and pragmatics. It states that syntax studies linguistic forms and structures, semantics studies the relationship between linguistic forms and real-world entities, and pragmatics studies the relationship between linguistic forms and their users. It also explains that pragmatics developed to study aspects of language use that were not easily addressed by formal systems focused on syntax and semantics. Pragmatics examines how people understand meaning based on context and shared experiences.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
The document describes a deep BiGRU-CNN-CRF neural network model that jointly performs word segmentation, stemming, and named entity recognition for the Myanmar language. The model uses character-level and syllable-level representations as input. It was trained on a manually annotated Myanmar corpus. Evaluation results showed the neural sequence labeling architecture achieved state-of-the-art performance for the joint tasks of word segmentation, stemming, and named entity recognition in Myanmar text.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
The document discusses information retrieval based on word semantics in Arabic texts. It covers several key areas: (1) the challenges of natural language processing in Arabic due to its rich morphology; (2) the process of morphological analysis including preprocessing, stemming, and indexing terms; (3) the research problems of synonymy and polysemy in information retrieval; and (4) semantic approaches to address these problems including automatic discovery of similar words and synonym-based search methods.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
The document discusses the fields of syntax, semantics, and pragmatics. It states that syntax studies linguistic forms and structures, semantics studies the relationship between linguistic forms and real-world entities, and pragmatics studies the relationship between linguistic forms and their users. It also explains that pragmatics developed to study aspects of language use that were not easily addressed by formal systems focused on syntax and semantics. Pragmatics examines how people understand meaning based on context and shared experiences.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
Natural Language Generation from First-Order ExpressionsThomas Mathew
In this paper I discuss an approach for generating natural language from a first-order logic representation. The approach shows that a grammar definition for the natural language and a
lambda calculus based semantic rule collection can be applied for bi-directional translation using an overgenerate-and-prune mechanism.
Exploiting rules for resolving ambiguity in marathi language texteSAT Journals
Abstract
Natural language ambiguity is a situation involving some words having multiple meanings/senses. This paper discusses natural
language ambiguity and its types. Further we propose a knowledge based solution to resolve various types of ambiguity occurring
in Marathi language text. The task of resolving semantic and lexical ambiguity occurring in words to obtain the actual sense is
referred as Word Sense Disambiguation (WSD). Marathi language is the official and commonly spoken language of Maharashtra
state in India. Plenty of words in Marathi are spelled same as well as uttered same but are semantically (meaning-wise/ sensewise)
different. During the automatic translation, these words lead to ambiguity. Our method successfully removes the ambiguity
by identifying the correct sense of the given text from the predefined possible senses available in Marathi Wordnet using word and
sentence rules. The method is applicable only for word level ambiguity. Structural ambiguity is not handled by this system. This
system may be successfully used as a subsystem in other Natural Language Processing (NLP) applications.
Key Words: Word Sense Disambiguation, Natural Language Processing, Marathi, Marathi Wordnet, ambiguity,
knowledge based
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Author Credits - Maaz Nomani
A Proposition Bank is a collection of sentences which are hand-annotated with the information of semantic labels in the respective sentences. Currently, around 10,000 sentences containing 0.2 million words have been hand-annotated with the semantic labels information.
This is a natural language resource of very rich linguistic information which can be used in a variety of NLP applications such as semantic parsing, syntactic parsing, sentiment analysis, dialogue systems etc.
In this paper, we present one such resource for a resource-poor Indian language Urdu. The Proposition Bank of Urdu is built on already built Treebank of Urdu (A Treebank is a corpus of sentences annotated with their POS, morphological, head, TAM and dependency labels information). A Propbank adds a layer of semantic information over this Treebank and hence can facilitate semantic parsing and other semantic level operations in a natural language sentence.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
Tokenization is considered a solved problem when reduced to just word borders identification, punctuation
and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent
NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use
in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related
information as necessary. We also claim that semantic disambiguation performs much better in a bilingual
context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text
provided by high profile on-line machine translation services performs almost to the same level with
human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm
incorporated in TORO can be used as a criterion for on-line machine translation services comparative
quality assessment and we provide a setup for this purpose.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
This document discusses the basic tasks involved in natural language processing (NLP). It describes the different phases of NLP including phonetics, lexical analysis, syntactic analysis, semantic analysis, discourse analysis, and pragmatic analysis. It then explains some basic NLP activities like tokenization, sentence splitting, and part-of-speech tagging. The goal of NLP is to enable computers to understand and process human languages through computational modeling.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Corpus-Based Vocabulary Learning in Technical EnglishCSCJournals
One of the main challenges posed in front of English for Specific Purposes (ESP) teachers in developing syllabi at higher education level is the choice of vocabulary to be taught. This issue is particularly prominent in technical English, which apart from being abundant in nouns, requires students to learn the other highly frequent noun-based structures such as multi-word lexical units (MWLUs). Learning how to cope with these condensed structures both in reading and writing, will make the students competent and self-confident ESP users. The choice of lexical items is usually left to teachers’ intuition. This paper intends to assist teachers in avoiding addressing the issue by making such intuitive decisions, offering the model of incorporating the corpus-based vocabulary findings into their ESP syllabi instead. Thus, the research questions addressed in this paper are: Can a computer software extract all the nouns, MWLUs and multi-noun lexical units (MNLUs) with 100% certainty? What is their precise number in the pedagogical corpus of English for Traffic and Transport Purposes (ETTP)? The paper approaches the issue from the analytical point of view, i.e. by instructing the teacher-researcher step by step in analysing the pedagogical specialized corpora, including the possible problems they might encounter using irreplaceable, yet not completely accurate computer software for the purpose. The paper proposes original solutions to overcome the encountered imperfections in order to get accurate and evidence-based lists of most frequent nouns, MWLUs and MNLUs, making some extra effort by manually complementing the computer-based analysis. By applying the methodology proposed by this paper teachers-researchers will no more have to wonder which nouns and MWLUs and MNLUs to teach, since they can create accurate lists by themselves. Furthermore, a specialized pedagogical corpus analysis provides a valuable basis for creating glossaries and specialized minimum dictionaries, serving as a source for creating syllabi and lexis oriented exercises, as well as for designing language tests – all of it with the ultimate scope of improving students’ lexical competencies in a specific field of study.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
Using automated lexical resources in arabic sentence subjectivityijaia
This document describes research on developing an Arabic sentiment lexicon using an existing Arabic lexical semantic database called RDI. The researchers generated a sentiment lexicon called SentiRDI by determining the subjectivity and polarity of words and semantic fields in the RDI database. They used a seed list of sentiment words and semantic relations in RDI, like synonyms and antonyms, to propagate sentiment scores. The system achieved 76.1% accuracy in classifying sentences as subjective or objective using multiple machine learning algorithms and an Arabic language corpus. Key challenges in building Arabic sentiment resources included the lack of semantic databases and the morphological complexity of the Arabic language.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMkevig
This paper proposes an improved morphological analyser for Arabic pronominal system using finite state method. The main advantage of the finite state method is very flexible, powerful and efficient. The most important results about FSAs, relates the class of languages generated by finite state automaton to certain closure properties. This result makes the theory of finite-state automata a very versatile and descriptive framework. The main contribution of this work is the full analysis and the representation of morphological analysis of all the inflections of pronoun forms in Arabic. In this paper we build a finite state network for the inflectional forms of the root words, restricted to all the inflections and grammatical properties of generating the dependent and independent forms of pronouns in Arabic language. The results show high score of accuracy in the output with all the needed linguistic features and the evaluation process of output is conducted using f-score test and the achievement is at the rate of 80% to 83%. The results from the study also provide the evidence that Arabic has strong concatenative word formations.
Natural Language Generation from First-Order ExpressionsThomas Mathew
In this paper I discuss an approach for generating natural language from a first-order logic representation. The approach shows that a grammar definition for the natural language and a
lambda calculus based semantic rule collection can be applied for bi-directional translation using an overgenerate-and-prune mechanism.
Exploiting rules for resolving ambiguity in marathi language texteSAT Journals
Abstract
Natural language ambiguity is a situation involving some words having multiple meanings/senses. This paper discusses natural
language ambiguity and its types. Further we propose a knowledge based solution to resolve various types of ambiguity occurring
in Marathi language text. The task of resolving semantic and lexical ambiguity occurring in words to obtain the actual sense is
referred as Word Sense Disambiguation (WSD). Marathi language is the official and commonly spoken language of Maharashtra
state in India. Plenty of words in Marathi are spelled same as well as uttered same but are semantically (meaning-wise/ sensewise)
different. During the automatic translation, these words lead to ambiguity. Our method successfully removes the ambiguity
by identifying the correct sense of the given text from the predefined possible senses available in Marathi Wordnet using word and
sentence rules. The method is applicable only for word level ambiguity. Structural ambiguity is not handled by this system. This
system may be successfully used as a subsystem in other Natural Language Processing (NLP) applications.
Key Words: Word Sense Disambiguation, Natural Language Processing, Marathi, Marathi Wordnet, ambiguity,
knowledge based
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Author Credits - Maaz Nomani
A Proposition Bank is a collection of sentences which are hand-annotated with the information of semantic labels in the respective sentences. Currently, around 10,000 sentences containing 0.2 million words have been hand-annotated with the semantic labels information.
This is a natural language resource of very rich linguistic information which can be used in a variety of NLP applications such as semantic parsing, syntactic parsing, sentiment analysis, dialogue systems etc.
In this paper, we present one such resource for a resource-poor Indian language Urdu. The Proposition Bank of Urdu is built on already built Treebank of Urdu (A Treebank is a corpus of sentences annotated with their POS, morphological, head, TAM and dependency labels information). A Propbank adds a layer of semantic information over this Treebank and hence can facilitate semantic parsing and other semantic level operations in a natural language sentence.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
Tokenization is considered a solved problem when reduced to just word borders identification, punctuation
and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent
NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use
in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related
information as necessary. We also claim that semantic disambiguation performs much better in a bilingual
context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text
provided by high profile on-line machine translation services performs almost to the same level with
human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm
incorporated in TORO can be used as a criterion for on-line machine translation services comparative
quality assessment and we provide a setup for this purpose.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
This document discusses the basic tasks involved in natural language processing (NLP). It describes the different phases of NLP including phonetics, lexical analysis, syntactic analysis, semantic analysis, discourse analysis, and pragmatic analysis. It then explains some basic NLP activities like tokenization, sentence splitting, and part-of-speech tagging. The goal of NLP is to enable computers to understand and process human languages through computational modeling.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Corpus-Based Vocabulary Learning in Technical EnglishCSCJournals
One of the main challenges posed in front of English for Specific Purposes (ESP) teachers in developing syllabi at higher education level is the choice of vocabulary to be taught. This issue is particularly prominent in technical English, which apart from being abundant in nouns, requires students to learn the other highly frequent noun-based structures such as multi-word lexical units (MWLUs). Learning how to cope with these condensed structures both in reading and writing, will make the students competent and self-confident ESP users. The choice of lexical items is usually left to teachers’ intuition. This paper intends to assist teachers in avoiding addressing the issue by making such intuitive decisions, offering the model of incorporating the corpus-based vocabulary findings into their ESP syllabi instead. Thus, the research questions addressed in this paper are: Can a computer software extract all the nouns, MWLUs and multi-noun lexical units (MNLUs) with 100% certainty? What is their precise number in the pedagogical corpus of English for Traffic and Transport Purposes (ETTP)? The paper approaches the issue from the analytical point of view, i.e. by instructing the teacher-researcher step by step in analysing the pedagogical specialized corpora, including the possible problems they might encounter using irreplaceable, yet not completely accurate computer software for the purpose. The paper proposes original solutions to overcome the encountered imperfections in order to get accurate and evidence-based lists of most frequent nouns, MWLUs and MNLUs, making some extra effort by manually complementing the computer-based analysis. By applying the methodology proposed by this paper teachers-researchers will no more have to wonder which nouns and MWLUs and MNLUs to teach, since they can create accurate lists by themselves. Furthermore, a specialized pedagogical corpus analysis provides a valuable basis for creating glossaries and specialized minimum dictionaries, serving as a source for creating syllabi and lexis oriented exercises, as well as for designing language tests – all of it with the ultimate scope of improving students’ lexical competencies in a specific field of study.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
Using automated lexical resources in arabic sentence subjectivityijaia
This document describes research on developing an Arabic sentiment lexicon using an existing Arabic lexical semantic database called RDI. The researchers generated a sentiment lexicon called SentiRDI by determining the subjectivity and polarity of words and semantic fields in the RDI database. They used a seed list of sentiment words and semantic relations in RDI, like synonyms and antonyms, to propagate sentiment scores. The system achieved 76.1% accuracy in classifying sentences as subjective or objective using multiple machine learning algorithms and an Arabic language corpus. Key challenges in building Arabic sentiment resources included the lack of semantic databases and the morphological complexity of the Arabic language.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMkevig
This paper proposes an improved morphological analyser for Arabic pronominal system using finite state method. The main advantage of the finite state method is very flexible, powerful and efficient. The most important results about FSAs, relates the class of languages generated by finite state automaton to certain closure properties. This result makes the theory of finite-state automata a very versatile and descriptive framework. The main contribution of this work is the full analysis and the representation of morphological analysis of all the inflections of pronoun forms in Arabic. In this paper we build a finite state network for the inflectional forms of the root words, restricted to all the inflections and grammatical properties of generating the dependent and independent forms of pronouns in Arabic language. The results show high score of accuracy in the output with all the needed linguistic features and the evaluation process of output is conducted using f-score test and the achievement is at the rate of 80% to 83%. The results from the study also provide the evidence that Arabic has strong concatenative word formations.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document summarizes current research on morphological analysis techniques for the Assamese language. It discusses prior work using rule-based and unsupervised methods for morphological analysis of several Indian languages, including Hindi, Bengali, Punjabi, Marathi, Tamil, Malayalam, Kannada, and Assamese. For Assamese specifically, it describes several studies that used suffix stripping and rule-based approaches to develop morphological analyzers, as well as some initial work on unsupervised techniques. The document concludes that while most existing work on Assamese has used supervised suffix stripping methods, unsupervised techniques show promise but have not been fully explored.
The document provides background information on derivational processes in the Matbat language. It begins by discussing how language is used for communication between humans and as an identifier of one's cultural background. It then notes that morphology, the study of how words are formed, is divided into inflection and derivation. The author examines the derivational process in the Matbat language by looking at changes in word class, such as from adjective to verb or noun. Some examples of derivational suffixes changing nouns in Matbat are provided. The purpose of the research is to investigate this derivational process in the language. It is expected that the results will provide useful information for language learners and teachers. The scope is limited to looking at changes between adjective
Derivational process in matbat language (jurnal ijhan)Trijan Faam
This document discusses a study on derivational processes in the Matbat language. It begins with an abstract that outlines the research aims to investigate derivational processes, specifically looking at changes from adjective to verb, adjective to noun, and verb to noun. The introduction provides background on morphology and derivational processes, and states the research question is to examine these processes in Matbat. It reviews related literature on morphemic shifts and word formation processes. The scope is limited to the specified changes between word classes.
This document discusses key concepts in morphology including:
1) Morphemes are the minimal units of meaning or grammatical function that can be free or bound. Free morphemes can stand alone as words while bound morphemes are attached to other forms.
2) Morphology analyzes the structure of words including prefixes, suffixes, and parts of words. It also distinguishes between lexical/functional and derivational/inflectional morphemes.
3) Morphological fusion refers to the degree to which an affix is phonologically joined to a stem, which correlates with how semantically relevant the affix is to the stem. Allomorphy describes phonological alternations in morphemes
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESLinda Garcia
This document summarizes several existing grammar checkers for various natural languages. It discusses rule-based, statistical, and hybrid approaches to grammar checking. Grammar checkers described include those for Afan Oromo, Amharic, Swedish, Icelandic, Nepali, and Portuguese. The document analyzes the approaches, methodologies, advantages, and limitations of each grammar checker.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
This document summarizes and reviews various grammar checkers for natural languages. It begins by defining key concepts in natural language processing like computational linguistics and grammar checking. It then describes the general working of grammar checkers, which involves preprocessing text, analyzing morphology and syntax, and identifying grammatical errors. The document surveys grammar checking approaches for several languages like rule-based, statistical, and hybrid methods. Specific grammar checkers are discussed for languages like Afan Oromo, Amharic, Swedish, Icelandic, Nepali, and Portuguese. The review concludes by analyzing the features and limitations of existing grammar checking systems.
This document describes a proposed method for introducing a new semantic relation between adjectives and nouns in WordNets. The relation is intended to capture attributes of nouns expressed by frequently occurring adjectives in simile constructions (e.g. "as busy as a bee" would link "busy" to "bee"). The method involves extracting simile constructions from an annotated corpus and identifying the most frequent adjective-noun pairs. These pairs would then be added as a new "specific attribute" relation to the Serbian WordNet to help with sentiment analysis and detecting figurative language. An evaluation found 84% of automatically identified pairs were also selected by human judges.
The document discusses comparing the native Kadazan language of Penampang with English in terms of morphology, syntax and semantics. It provides two examples of metaphors that exist in the Kadazan language and how English expresses similar metaphors. The formation of compound sentences is analyzed for both languages, noting they are similar in using coordinators but different in word order. Morphological systems are compared, finding similarities in meaning change with inflectional and derivational morphemes but differences in word length and use of prefixes versus suffixes.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Deterministic Finite State Automaton of Arabic Verb System: A Morphological S...CSCJournals
Finite State Morphology serves as an important tool for investigators of natural language processing. Morphological Analysis forms an essential preprocessing step in natural language processing. This paper discusses the morphological analysis and processing of verb forms in Arabic. It focuses on the inflected verb forms and discusses the perfective, imperfective and imperatives. The deterministic finite state morphological parser for the verb forms can deal with Morphological and orthographic features of Arabic and the morphological processes which are involved in Arabic verb formation and conjugation. We use this model to generate and add all the necessary information (prefix, suffix, stem, etc.) to each morpheme of the words; so we need subtags for each morpheme. Using Finite State tool to build the computational lexicon that are usually structured with a list of the stems and affixes of the language together with a representation that tells us how words can be structured together and how the network of all forms can be represented.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
This document discusses models of communicative language ability and the role of grammar within those models. It summarizes Lado's skills-and-elements model which viewed language as separate skills of phonology, structure, and lexicon. Later models recognized grammar's relationship to meaning and context. Canale and Swain's model defined grammatical competence as rules of form and meaning, but did not distinguish their relationship. Bachman and Palmer's comprehensive model views language ability as consisting of organizational knowledge, including grammatical and textual knowledge, and pragmatic knowledge, including functional and sociolinguistic knowledge. It defines the components of grammar and their relationship to meaning and language use.
The document discusses how readers use different approaches and background knowledge to interpret texts. Some students may use letter-sound knowledge to decode unknown words, while others rely more on semantic knowledge and meaning. Both methods are valid ways for individuals to understand texts, as people process information in various ways. Effective reading involves using multiple sources of information, including visual, phonetic, and syntactic cues, without a fixed order.
Similar to Addlaall search-engine--hattab-haddad-yaseen-uop (20)
1. Addaall Arabic Search Engine: Improving Search based on Combination of
Morphological Analysis and Generation Considering Semantic Patterns
Mamoun Hattab*, Bassam Haddad†, Mustafa Yaseen‡, Asem Duraidi* and A. Abu Shmais*
*Arabic Textware Inc., Amman-Jordan
{m.hattab, a.duraidi, a.abushmais}@arabtext.ws
†
University of Petra, Department of Computer Science, Amman-Jordan
haddad@uop.edu.jo
‡
Amman University, Department of Computer Science, Amman-Jordan
myaseen@ammanu.edu.jo edu.jo
Abstract
This paper addresses some issues involved in utilization of different levels of Arabic morphological knowledge in improving the
search process in a search engine. The different levels of morphological knowledge can be considered as an incremental process
considering the next sensitive word pattern in the context of enhancing the possibility of covering a higher recall and precision.
1. Introduction
This paper attempts to demonstrate how the utilization of need for a linguistic approach to handle the search process
different levels of Arabic morphological knowledge can (Brin S. and Page L., 1998).
improve the searching process in a Search Engine. We can Addaall1 Arabic Search Engine built by Arabic Textware
show how combining morphological analysis and Inc2. utilizes a morphological analyzer and generator to
generation can be used for retrieving information based construct different indices based on both the root and stem
on semi-semantic linguistic search in Arabic. This of a word. While retrieving information based on the root
technique extends the usage of roots and stems into a overflows the user with a complete but less relevant set of
method of categorization for morphological patterns into results, the use of the stem based search not only retrieves
semantic groups, and uses a morphological generator to results in lesser numbers but also the set of results is less
produce the different inflections of a word to be retrieved accurate.
based on the semantic distance from the word.
Google approach is primary based on the matching
process between the user’s key words and the texts under 2. Morphological Knowledge implies
which it is indexed in terms of words (Brin S. and Page Syntactic and Semantic Indicators
L., 1998). Words are treated as a set of symbols, not
words with meanings to human users. And while plain There are much information produced by the
linguistic measures, such as stemming and structured data morphological analysis which are useful in the next steps
search are used to enhance results, in nature Google and of processing such as the syntactic and the semantic
such search engines are still "Symbolic Computing" analysis. Some of the information is:
machines.
In third generation of search engines, Natural Language The morphological type of a word
Processing technologies are applied in searching We mean by "morphological type" determining if the
extensively, because in the first place, the search is seen word is noun, verb, or particle; and what subtype of
as a language understanding process. This approach for nouns, verbs, particles it is. It is clearly noticed that
search is paradigmatically different and a level higher in the syntactic functionality of the word is highly
terms of the degrees of system difficulty and complexity. dependent on the morphological type of the word. For
And it offers more accurate and consistent search results example, the syntactic function of a "subject" requires
than the second generation search engines through its that the morphological type is a noun. The
intelligence in language understanding. In this context morphological type provides the semantic analysis
many researchers have considered this aspect (Abuleil and with the semantic properties of each word. For
Alsamra, 2004; Chen and Gey, 2002; Aljlayl M., 2002).
example, the determining of (/ ,/ﻓَـﻌﹼﺎﻝfa،، āl) to be an
For a language like Arabic where the structure of a word
can change according to many factors while maintaining exaggeration form tells that the meaning is much
the same meaning, "Symbolic Computing" does not
retrieve an accurate set of results and there is an immense 1
http://www.addaall.com
2
http://www.arabtext.ws/
159
2. emphasized. root in Arabic is the part which contains the meaning;
it is the core of language. Anyway, some roots has
The Determiner
We mean by "determiner" any indeclinable noun, verb, syntactic properties like (/ )/ ﺣﺴﺐwhich tells that the
determiner or particle in addition to the affixes. It is verb needs two objects.
clear that the variety of the syntactic functions which
the determiners can do is wide. The proposition (/,/ ﰲ 2.1 Morphological Levels for Search
fī, in) provides that the word follows is a noun and
genitive (Haddad B., 2007). Even the letters such as To measure the closeness between the meaning of the
word being searched for, and the suggested words to be
(/,/ ﻭ wāw) in (/ )/ ﻣﹸﺮﺳِﻠﻮwhich is a pronoun, tells that searched for, we adopted the concept of utilizing different
levels of morphological knowledge. The least degree of
the whole word is sound masculine plural, which
relationship is the strongest between the original word and
affects the whole sentence.
the alternatives.
One of the examples of how the determiners benefits
In the Zero Level the same and identical word is
the semantic analysis is the prefix (// ﺍﻝl)3 which tells considered in the search process.
that the word following is a definite noun.
Furthermore the state of indefinite and dual is which Level One:
can extracted from the morphology can also A. For the verbs: verb inflections referring to the
semantically be considered as a source if semantic same tense. (e.g. )ﺿﺮﺑﺘﻢ /ﺿﺮﺑﻮﺍ /ﺿﺮﺏﹶ
knowledge such as unique quantification. Another
B. For the derivative nouns: its grammatical states
example is the prefix (/ ,/ ﺱs) that direct the meaning
(.)ﻗﺎﺋﻞٌ، ﻗﺎﺋﻼﹰ، ﻗﺎﺋﻞٍ، ﺍﻟﻘﺎﺋﻞُ، ﺍﻟﻘﺎﺋﻞَ، ﺍﻟﻘﺎﺋﻞ
of present verb to future.
C. For the gerund: its grammatical states ( ،ًﻗﻮﻝٌ، ﻗﻮﹾﻻ
The Morphological Pattern of a word ٍ.)ﻗﻮﹾﻝ
It is true, even not clear for every one, that the patterns
of Arabic words include syntactic and semantic Level Two:
content. The pattern (/َ )/ﻓَﻌﹸﻞof the past verb tells that the A. For the verb: verb inflections referring to all
verb is intransitive which means that the sentence is tenses. (e.g. )...ﺗﻀﺮﺑﺎﻥ /ﺍﺿﺮﺏ /ﻳﻀﺮﺏ /ﺿﺮﺏﹶ
complete with no need to have an object. An example B. For the Derivative noun: its grammatical and
of the semantic profit is the meaning of request that
morphological states ( ٌﻗﺎﺋﻼﺕﹲ، ﻗﺎﺋﻠﲔ، /ﻗﺎﺋﻼﻥ /ﻗﺎﺋﻠﺔ ٌ /ﻗﺎﺋﻞ
given by the pattern (/ )/ﺍﺳﺘﻔﻌﻞand its derivatives.
ِ.).ﻗﺎﺋﻼﺕٍ، ﺍﻟﻘﺎﺋﻠﻮﻥ، ﺍﻟﻘﺎﺋﻼﺕﹸ، ﺍﻟﻘﺎﺋﻼﺕ
The root C. For the gerund: its grammatical and
The root: All roots have a semantic aspect because the morphological states (ِ.)ﻗﻮﻝٌ، ﻗﻮﹾﻻً، ﻗﻮﹾﻝٍ، ﺍﻟﻘﻮﹾﻝُ، ﺍﻟﻘﻮﹾﻝَ، ﺍﻟﻘﻮﹾﻝ
3 Level Three:
Please notice that deep semantic analysis considering the
logical compositionality of determiners such as (/، ,/ ﺍﻝl) is out of
A. Connecting the verb only to its gerund and vice
versa (e.g. .)ﺍﻟﻀﺮﺏِ /ﺿﺮﹾﺏ /ﺿﺮﹶﺏﹶ
scope of this paper. Determiners in the logical sense (Haddad B.
2007) represent quantifiers and a special case of Generalized
Arabic Quantifier (GAQ) such as: B. Connecting the derivative noun to its
corresponding morphological counterparts that
⎡( / , /ﻣﻌﻈﻢ-ﺍﻝmost-of -the)⎤ share the same morphological category. For
⎢ ⎥ ≡ λR.λS.( /(ﻣﻌﻈﻢ-ﺍﻝx)/, most-of-
⎢CAT DETArab
⎣ ⎥
⎦ sem
Example connecting exaggeration pattern
the(x) ). (R, S), where |R ∩ S| > |R –S| , whereas and in together such as (... )/ﻗﻮﹼﺍﻝٌ / ﻣِﻘﻮﺍﻝ
this context (/، ,/ ﺍﻝl) can be regarded as a numeric
Level Four:
quantifier:
A. Connecting the verb to its gerund and its
derivative nouns and vice versa (e.g. ﻣﺴﺘﻘﻴﻞ /ﺍﺳﺘﻘﺎﻝ
⎢ ( /, /ﺍﻝThe)
⎡ ⎤
⎥
as ⎡(/ , / 1 ﺍﻝThe1 )⎤ / .)ﻣﺴﺘﻘﺎﻝ
⎢ ⎥
⎢ CAT DET ⎥ ⎣ ⎦sem
where
⎢
⎢ AG ⎡ NUM sing ⎤ ⎥
⎦ ⎥sem
⎣ ⎣ ⎦ B. Connecting the gerunds and the derivative nouns
⎡(/ﺍﻝ
to each other ( .)ﺍﺳﺘﻔﺎﺩﺓ /ﻣﺴﺘﻔﺎﺩ /ﻣﺴﺘﻔﻴﺪ
⎣ 1 /, The1 )⎤
⎦ sem
≡ λP. λQ .∃x.(∀y (P(y) ⇔ x = y) ∧ Q(x))
160
3. following:
And finally there is the Root Level, representing the
most possible search in the context of semantic In the same level of (/ ,/ ﺿَﺮﹶﺏﹶhits) is the only suggested
dependency.
alternative.
In next level or the first one, the suggested alternatives are
3. Enhancing Search based on different level of
Morphological Knowledge ( …ﺿَﺮﹶﺑﹾﺖﹶ /ﺿَﺮﹶﺑﹶﺎ /ﺿَﺮﹶﺑﹾﺘﻢ /ﺿَﺮﹶﺑﻮﺍ /ﺿَﺮﹶﺏﹶetc.)
To enhance the quality of results by minimizing the The next level would suggest alternatives such as ( /ﺿﺮﹶﺏﹶ
number of results while maintaining more accurate
relevancy and precision, a hybrid approach combining …ﺍﻟﻀﺮﺏِ /ﺿﺮﹾﺏetc.)
morphological analysis and generation was applied. In
this approach a query goes through the following In next level the suggested candidates are ( /ﻳﻀﺮﺏ /ﺿﺮﺏﹶ
processes:
...ﺗﻀﺮﺑﺎﻥ /ﺍﺿﺮﺏetc.)
Morpho-Syntactic analysis to define its root/stem
and part of speech. In fourth level the suggested alternatives are ( /ﺿﺎﺭﺏ /ﺿَﺮﹶﺏﹶ
Morphological generation to produce the nearest …ﻣﻀﺮﻭﺑﺘﻴﹾﻦ /ﻣﻀﺮﻭﺏetc.)
inflections to that word based on a semantic
categorization of the different patterns that may In last level the suggested alternatives are all the nouns
apply to that root/stem. and verbs that have the root (.)ﺿَﺮﹶﺏﹶ
When you search based on a word’s root, you are sure to
get all of the relevant results, but with other irrelevant
ones. Because although each result retrieved includes at
least one inflection of the word, this inflection might not 4. Conclusion and Outlook
be necessarily relevant. In this paper, we have tried to stress on the importance of
This feature of Arabic language can not be controlled utilizing different level of morphological knowledge
without having a lexicon that stores semantic knowledge. towards improving the quality and quantity of the Addaall
In that case, all the words will be defined with their Arabic Search engine. The results are very promising as
semantic relationships to other words in the lexicon, the coverage and the sensitivity of this approach is
extending the possibilities to cover not only inflections relatively incredibly practical (see link to the Search
but synonyms and even related conceptual and ontological Engine http://www.addaall.com).
knowledge. Furthermore, our approach has considered the concept of
In the case of stem search, stemming techniques for categorizing Arabic patterns according to their meanings,
Arabic are not very well defined until now. The definition whereas some semi-semantic information has been useful
of a stem in Arabic is often exchanged with that of the in the search process. In this context, we have established
root. Many researchers even argue that if the stem has a a matrix correlating each pattern of speech to semantically
different meaning than a root then it does not apply to related pattern. The different levels of morphological
Arabic language. They consider it a foreign concept. knowledge can be considered as an incremental process
considering the next sensitive word pattern; i.e. to in-
When applying stemming to an Arabic word, we mean
crease the recall, and as a predication and heuristic proc-
that we eliminate or remove prefixes and suffixes from
ess to increase the precision; as heuristically related word
that word. What is left is a partial word that in many
might occurred in a near distance.
cases has possibly no meaning and does not exist in the
dictionary as it is. We then use this partial word to do Finally, although we understand that this approach pro-
partial search looking for words that include this part. duces a relative and not complete set of results, we can
see that it in fact represents a significant improvement
The outcome of such a search has fewer results than the
towards Arabic search engines in view of its practical and
search by root. But how significant are these in term of
pragmatic use in real implementations.
precision and recall? That is why we believe that Arabic
stem search is neither accurate nor comprehensive.
This led us to suggest a method to improve Arabic search References
techniques by using the morphological generator. In order
to do so, we classified the morphological patterns of Abuleil S., Alsamra K. (2004). New Technique to Support
Arabic semantically according to their meanings. For each Arabic Noun Morphology: Arabic Noun Classifier
word we want to search for, the most morphologically and System (ANCS). International Journal of Computer
semantically related word form will be generated to be a Processing of Oriental Languages, Vol. 17 Issue 2.
subject of the improved search process.
Aitao Chen and Fredric Gey (2002). Building an Arabic
An Example:
stemmer for information retrieval. The Eleventh Text
If the user is searching for (/ ,/ ﺿَﺮﹶﺏﹶhits) then we notice the Retrieval Conference (TREC 2002), National Institute
of Standards and Technology (NIST).
161
4. Aljlayl M. (2002). Arabic Search: Improving the Retrieval
Effectiveness via a Light Stemming Approach. ACM
Eleventh Conference on Information and Knowledge
Management.
Brin S. and Page L. (1998). The anatomy of a large
hypertextual web search engine. Web publication,
Stanford University.
Haddad Bassam (2007). Semantic Representation of Ara-
bic: A Logical Approach towards Compositionality
and Generalized Arabic Quantifiers. International
Journal of Computer Processing of Oriental Lan-
guages, Vol. 20. Nr. 1. 2007
162