Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co occurrence across languages with the universal concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same
graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word cooccurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONijnlc
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In Arabic, there are few resources and these resources are not comprehensive. Most of the current research efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities. However, the coverage of all Arabic sentiment expressions can be applied using refined regular expressions rather than a large number of lexical entities. This paper presents an ASL that more comprehensive than the existing lexicons, for covering many expressions with different dialects including Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented to each
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONijnlc
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In Arabic, there are few resources and these resources are not comprehensive. Most of the current research efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities. However, the coverage of all Arabic sentiment expressions can be applied using refined regular expressions rather than a large number of lexical entities. This paper presents an ASL that more comprehensive than the existing lexicons, for covering many expressions with different dialects including Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented to each
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone. The newmethod proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.
A NEW METHOD OF TEACHING FIGURATIVE EXPRESSIONS TOIRANIAN LANGUAGE LEARNERScscpconf
In teaching languages, if we only consider direct relationship between form and meaning in language and leave psycholinguistic aside, this approach is not a successful practice and language learners won't be able to make a successful relation in the real contexts. The present study intends to answer this question: is the teaching method in which salient meaning is taught more successful than the method in which literal or figurative meaning is taught or not? To answer the research question, 30 students were selected. Every ten people are formed as a group and three such groups were formed. Twenty figurative expressions were taught to every group. Group one was taught the figurative meaning of every expression. Group two was taught the literal meaning and group three was taught the salient meaning. Then three groups were tested. After analyzing data, we concluded that there was a significant difference in mean grades between classes and the class trained under graded salience hypothesis was more successful. This shows that traditional teaching methods must be revised.
Author credits - Maaz Nomani
his paper presents our work on the annotation of intra-chunk dependencies on an English Treebank that was previously annotated with Inter-chunk dependencies. Given a natural language sentence, Intra-chunk dependencies show a relationship between words in a chunk and Inter-chunk dependencies showcase relationship among the chunks. For e.g. “James cut an apple with a knife”; In this sentence, chunking will result in (James) (cut) (an apple) (with a knife). Inter-chunk dependencies show dependency relation among these 4 chunks and Intra-chunk dependency show dependency relation between the words present in each chunk.
This exercise provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task. Further, these fully expanded parallel Hindi and English treebanks were word aligned and an analysis for the task has been given. Issues related to intra-chunk expansion and alignment for the language pair Hindi-English are discussed and guidelines for these tasks have been prepared and released.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONkevig
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In
Arabic, there are few resources and these resources are not comprehensive. Most of the current research
efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities.
However, the coverage of all Arabic sentiment expressions can be applied using refined regular
expressions rather than a large number of lexical entities. This paper presents an ASL that more
comprehensive than the existing lexicons, for covering many expressions with different dialects including
Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different
lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical
information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented
to each
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
This presentation about comparing different word embedding models and libraries like word2vec and fastText, describing their difference and showing pros and cons.
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone. The newmethod proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.
A NEW METHOD OF TEACHING FIGURATIVE EXPRESSIONS TOIRANIAN LANGUAGE LEARNERScscpconf
In teaching languages, if we only consider direct relationship between form and meaning in language and leave psycholinguistic aside, this approach is not a successful practice and language learners won't be able to make a successful relation in the real contexts. The present study intends to answer this question: is the teaching method in which salient meaning is taught more successful than the method in which literal or figurative meaning is taught or not? To answer the research question, 30 students were selected. Every ten people are formed as a group and three such groups were formed. Twenty figurative expressions were taught to every group. Group one was taught the figurative meaning of every expression. Group two was taught the literal meaning and group three was taught the salient meaning. Then three groups were tested. After analyzing data, we concluded that there was a significant difference in mean grades between classes and the class trained under graded salience hypothesis was more successful. This shows that traditional teaching methods must be revised.
Author credits - Maaz Nomani
his paper presents our work on the annotation of intra-chunk dependencies on an English Treebank that was previously annotated with Inter-chunk dependencies. Given a natural language sentence, Intra-chunk dependencies show a relationship between words in a chunk and Inter-chunk dependencies showcase relationship among the chunks. For e.g. “James cut an apple with a knife”; In this sentence, chunking will result in (James) (cut) (an apple) (with a knife). Inter-chunk dependencies show dependency relation among these 4 chunks and Intra-chunk dependency show dependency relation between the words present in each chunk.
This exercise provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task. Further, these fully expanded parallel Hindi and English treebanks were word aligned and an analysis for the task has been given. Issues related to intra-chunk expansion and alignment for the language pair Hindi-English are discussed and guidelines for these tasks have been prepared and released.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
FURTHER INVESTIGATIONS ON DEVELOPING AN ARABIC SENTIMENT LEXICONkevig
The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In
Arabic, there are few resources and these resources are not comprehensive. Most of the current research
efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities.
However, the coverage of all Arabic sentiment expressions can be applied using refined regular
expressions rather than a large number of lexical entities. This paper presents an ASL that more
comprehensive than the existing lexicons, for covering many expressions with different dialects including
Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different
lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical
information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented
to each
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
Grammatical error correction (GEC) greatly benefits from large quantities of high-quality training data.
However, the preparation of a large amount of labelled training data is time-consuming and prone to
human errors. These issues have become major obstacles in training GEC systems. Recently, the
performance of English GEC systems has drastically been enhanced by the application of deep neural
networks that generate a large amount of synthetic data from limited samples. While GEC has extensively
been studied in languages such as English and Chinese, no attempts have been made to generate synthetic
data for improving Persian GEC systems. Given the substantial grammatical and semantic differences of
the Persian language, in this paper, we propose a new deep learning framework to create large enough
synthetic sentences that are grammatically incorrect for training Persian GEC systems. A modified version
of sequence generative adversarial net with policy gradient is developed, in which the size of the model is
scaled down and the hyperparameters are tuned. The generator is trained in an adversarial framework on
a limited dataset of 8000 samples. Our proposed adversarial framework achieved bilingual evaluation
understudy (BLEU) scores of 64.5% on BLEU-2, 44.2% on BLEU-3, and 21.4% on BLEU-4, and
outperformed the conventional supervised-trained long short-term memory using maximum likelihood
estimation and recently proposed sequence labeler using neural machine translation augmentation. This
shows promise toward improving the performance of GEC systems by generating a large amount of
training data.
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
Grammatical error correction (GEC) greatly benefits from large quantities of high-quality training data.
However, the preparation of a large amount of labelled training data is time-consuming and prone to
human errors. These issues have become major obstacles in training GEC systems. Recently, the
performance of English GEC systems has drastically been enhanced by the application of deep neural
networks that generate a large amount of synthetic data from limited samples. While GEC has extensively
been studied in languages such as English and Chinese, no attempts have been made to generate synthetic
data for improving Persian GEC systems. Given the substantial grammatical and semantic differences of
the Persian language, in this paper, we propose a new deep learning framework to create large enough
synthetic sentences that are grammatically incorrect for training Persian GEC systems. A modified version
of sequence generative adversarial net with policy gradient is developed, in which the size of the model is
scaled down and the hyperparameters are tuned. The generator is trained in an adversarial framework on
a limited dataset of 8000 samples. Our proposed adversarial framework achieved bilingual evaluation
understudy (BLEU) scores of 64.5% on BLEU-2, 44.2% on BLEU-3, and 21.4% on BLEU-4, and
outperformed the conventional supervised-trained long short-term memory using maximum likelihood
estimation and recently proposed sequence labeler using neural machine translation augmentation. This
shows promise toward improving the performance of GEC systems by generating a large amount of
training data.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
In this paper, we propose a method that aligns comparable bilingual tweets which, not only
takes into account the specificity of a Tweet, but treats also proper names, dates and numbers in
two different languages. This permits to retrieve more relevant target tweets. The process of
matching proper names between Arabic and English is a difficult task, because these two
languages use different scripts. For that, we used an approach which projects the sounds of an
English proper name into Arabic and aligns it with the most appropriate proper name. We
evaluated the method with a classical measure and compared it to the one we developed. The
experiments have been achieved on two parallel corpora and shows that our measure
outperforms the baseline by 5.6% at R@1 recall.
KEYWORDS
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESmlaij
The proposed methodology presented in the paper deals with solving the problem of multilingual speech
recognition. Current text and speech recognition and translation methods have a very low accuracy in
translating sentences which contain a mixture of two or more different languages. The paper proposes a
novel approach to tackling this problem and highlights some of the drawbacks of current recognition and
translation methods.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP “Polish National Corpus”). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...ijnlc
One of the main difficulties in sentiment analysis of the Arabic language is the presence of the
colloquialism. In this paper, we examine the effect of using objective words in conjunction with sentimental
words on sentiment classification for the colloquial Arabic reviews, specifically Jordanian colloquial
reviews. The reviews often include both sentimental and objective words; however, the most existing
sentiment analysis models ignore the objective words as they are considered useless. In this work, we
created tow lexicons: the first includes the colloquial sentimental words and compound phrases, while the
other contains the objective words associated with values of sentiment tendency based on a particular
estimation method. We used these lexicons to extract sentiment features that would be training input to the
Support Vector Machines (SVM) to classify the sentiment polarity of the reviews. The reviews dataset have
been collected manually from JEERAN website. The results of the experiments show that the proposed
approach improves the polarity classification in comparison to two baseline models, with accuracy 95.6%.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODIJwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the semi-supervised method with seed dictionaries and self-supervised adversarial training method through the similarity calculation of word clusters in different embedded spaces and puts forward an improved selfsupervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The experimental results are as follows. The seed dictionary of semi-supervised method made before 10 predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the self-supervision methods in both language directions have reached 53.5 accuracy.
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Methoddannyijwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan
and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the
semi-supervised method with seed dictionaries and self-supervised adversarial training method through the
similarity calculation of word clusters in different embedded spaces and puts forward an improved self-
supervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The
experimental results are as follows. The seed dictionary of semi-supervised method made before 10
predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the
self-supervision methods in both language directions have reached 53.5 accuracy.
Similar to LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS (20)
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
1. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
DOI : 10.5121/ijwsc.2019.10302 13
LEARNING CROSS-LINGUAL WORD EMBEDDINGS
WITH UNIVERSAL CONCEPTS
Pezhman Sheinidashtegol and Aibek Musaev1
1
Department of Computer Science, University of Alabama, Tuscaloosa, Alabama
ABSTRACT
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal
languages inspired new efforts to extend the model to support diversified languages. State-of-the-art
methods for learning cross-lingual word embeddings rely on the alignment of monolingual word
embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal
concepts’ method. Such concepts are notions that are fundamental to humankind and are thus persistent
across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal
concepts as undirected graphs of connected nodes and then replaced the words belonging to the same
graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual
corpora which will help generate meaningful word embeddings across languages via the word co-
occurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes
SOTA on bilingual word sematic similarity and word similarity relatedness tasks.
KEYWORDS
Word Embedding Model, NLP, Word Embedding Evaluation Tasks, Universal Concepts, & Bilingual and
Cross-lingual Word Embeddings.
1. INTRODUCTION
Previous works have proposed to learn cross-lingual word embeddings by aligning monolingual
word embedding spaces. These approaches generally fall into two categories: supervised methods
based on various transformations to align word embeddings on distinct languages and
unsupervised methods without the use of any parallel corpora. The first method using parallel
data supervision was proposed by [15]. A linear mapping from a source to a target embedding
space was learned and is shown to be effective for the translation of words between English and
Spanish. Another method proposed to normalize the word vectors and constrain the linear
mapping to be orthogonal [18]. Recent works by [6] introduced an unsupervised method to align
monolingual word embedding spaces using cross-domain similarity local scaling (CSLS). The
work by [12] leverages CSLS to improve the quality of bilingual word vector alignment.
In this paper, we propose an alternative approach to the described works that learns cross-lingual
word embeddings by injecting the seed universal concepts to the input monolingual texts. Such
concepts are notions that are fundamental to humankind and are thus persistent across languages,
e.g., the concept of a man or woman, war or peace, good or bad, etc. All words across languages
belonging to the same universal concept are replaced with a common unique identifier, such as
UUID [13]. The intuition is that the use of universal concepts in monolingual corpora from
different languages will help learn syntactic and semantic relationships for words across
languages via concept co-occurrence.
2. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
14
Note that this proposed approach results in a unified word embedding space for all languages.
This is in contrast with the existing works that rely on aligning monolingual word embeddings on
distinct languages. Hence, in this approach, there is no limit on the number of supported
languages. Each new language shares the universal concepts with other languages by adding
translation word pairs with any of the supported languages. We developed our approach
independently of Ammar et al., 2016 [1]. Two important distinctions are: a) use of BFS for
generation of universal concepts, and b) our evaluation is based on standard MUSE datasets.
These contributions are warranted.
In the proceeding, we introduce our approach and evaluate it using the bilingual word sematic
similarity and word similarity relatedness tasks. Afterwards, we obtain robust bilingual
embeddings for two language pairs, namely En-Es (English to Spanish) and En-De (English to
German). Then, we generate a unified cross-lingual word embedding for all three languages and
show that it does not degrade the quality of the generated word vectors. The code for our
approach and the cross-lingual word vectors are available at https://github.
com/aibek76/universal_concepts.
The rest of this paper is structured as follows. Section 2 briefly describes related publications.
Section 3 explains in detail the proposed methods. Section 4 describes the methods for evaluating
our generated word embeddings, and our findings. Finally, Section 5 presents our conclusions.
2. RELATED WORK
2.1. Word Representations
There are a number of effective existing approaches to creating monolingual word vectors. Two
related models, continuous bag-of-words (CBOW) and skip-gram [14] make up word2vec. Both
models make use of a context window surrounding the target word to adjust the weights of the
projection layer of the network. In the case of CBOW, the context words are used to predict the
current word, while skip-gram uses the current word to predict the other words in the context
window.
An extension of the skip-gram model under the name fastText was proposed by [5]. This model
makes use of sub-word information by means of n-grams. Vectors are learned for the various n-
grams in the training text, and word vectors are composed by summing these n-gram vectors. The
aim of using this approach is to create vector representations which more effectively encode
important morphological information, such as word affixes.
GloVe, a portmanteau of Global Vectors, makes use of both the local context of a given word and
global word co-occurrence counts [16]. Rather than solely considering the probability that an
individual word appears in the corpus at all, the model calculates probabilities that certain words
co-occur with the given target word. Words that are highly related to the target word will have a
high co-occurrence probability, while words that are unrelated will have a low co-occurrence
prob- ability. The authors argue that by considering more global occurrence statistics, their model
more effectively captures semantic information than models whose primary focus is the
immediate context window.
2.2. Learning Multilingual Word Embeddings Via Alignment
In addition to the available approaches for creating monolingual word vectors, a number of
various methods exist for taking sets of monolingual word vectors and aligning their vector
spaces to create a new vector space that is multilingual. Methods proposed by [15], [12], and [2]
3. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
15
apply different functions to align the vector spaces. For each of these methods, it is required that
there is a small dictionary of bilingual data in order to begin the alignment. Another variation
proposed by [17] also uses a bilingual dictionary but additionally presents the possibility of a
pseudo-dictionary, which makes the assumption that all strings that are identically spelled in the
target and source languages hold similar meanings and aligns the vector spaces using this pseudo-
dictionary with considerable success.
A few methods have been proposed that lack the requirement for the grounding bilingual
dictionary and are therefore unsupervised. [8] suggests creating a two-agent communication game
wherein one agent only understands language A, while the other agent only understands language
B. A translation model translates sentences and sends them to each agent which confirms whether
the sentences they received make sense in their monitored language. This model may be extended
to many languages, so long as a closed loop is created. Another option is using iterative matching
as is done in [10,19].
2.3. Core Vocabulary
A concept that is key to supporting the validity of our approach is core vocabulary. The various
studies that focus on core vocabulary highlight the fact that a surprisingly small number of words
make up a very significant fraction of actual words used in speech and other forms of
communication. [4] studied the words used by preschoolers across 3,000 recorded speech samples
and found that the top 250 most frequently used words accounted for 85% of the full sample, and
furthermore found that the top 25 words accounted for 45% of the sample. A similar study was
performed in Australian adults in the workplace and it found that 347 words accounted for 75%
of the total conversational sample [3]. The data from these studies validates the usefulness of
having a relatively small vocabulary for effective communication within a language. In our
proposed method, we use 5k words, which far exceeds the numbers provided in these studies and
therefore provides a robust, useful vocabulary for our multilingual vectors.
The word representation models described above are designed to create sets of monolingual
vectors. Though there are numerous methods available that attempt to align the distinct vector
spaces of sets of monolingual vectors created by these possible models, our method aims to
eliminate the need for aligning separate vector spaces. Our method may be implemented with any
of the above word representation models to create a unified, multilingual vector space without the
need for mapping monolingual vector spaces onto one another. Lastly, this unified vector space is
composed of a set of words that are highly useful in communication, as only a very small fraction
of words make up a significant percentage of actual words used in communication.
3. PROPOSED APPROACH
3.1. Overview
Here is an overview of the proposed approach. It consists of three steps as follows:
1. Pre-processing step: prepare monolingual corpora
(a) Build universal concepts using bilingual lexicons
(b) Generate UUID for each universal concept
(c) Replace words from universal concepts with corresponding UUIDs
2. Processing step: compute word embeddings using modified corpora
(a) Compute word embeddings based on word co-occurrence
3. Post-processing step: retrieve vectors for words in universal concepts
4. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
16
(a) Assign vectors computed for UUIDs to all words in corresponding universal
concepts
During the pre-processing step, monolingual corpora are modified to encode the seed universal
concepts. Then word embeddings based on word co-occurrence are computed using an existing
implementation, such as word2vec or fastText. Finally, during the pre-processing step, the vectors
computed for universal concepts are assigned to each word belonging to those concepts.
1.(a) Build universal concepts using bilingual lexicons. We propose to use bilingual lexicons as
the seed universal concepts as follows. Each bilingual lexicon contains translations as word pairs
between words in the origin language and their translations in the target language. We treat each
word as a node and each word pair as an undirected edge or link between nodes. Then we build
universal concepts as undirected graphs of connected nodes using breadth first search (BFS). The
resulting graphs represent concepts that have the same meaning across languages.
Here is the pseudocode for building universal concepts using Python’s syntax:
1.(b) Generate UUID for each universal concept. For each universal concept represented as a
graph, we generate a unique ID, such as UUID, and associate words in that graph with a
corresponding UUID.
1.(c) Replace words from universal concepts with corresponding UUIDs. Given the
downloaded monolingual corpora of Wikipedia, in English and Spanish, we combine them into a
single corpus. Then we replace the mentions of each word from universal concepts with a
corresponding UUID. This ensures that words belonging to the same concept are treated in the
same manner across languages.
2.(a) Compute word embeddings based on word co-occurrence. In this step a combined
corpus with encoded universal concepts is used as input for computing word embeddings based
on word co- occurrence. Note, that any implementation of the word co-occurrence approach, such
as word2vec or fastText, can be applied.
3.(a) Retrieve vectors for words in universal concepts. The vectors computed for UUIDs are
assigned to each word in corresponding universal concepts. The resulting word embeddings
contain vectors for the words in all languages.
5. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
17
3.2. Illustration of the approach
Here is an example of the input texts in three languages. Note, that we use abbreviations instead
of the actual UUID values for clarity:
• English: brown fox jumps over the lazy dog
• German: brauner fuchs springt u¨ber den faulen hund
• Spanish: zorro marro´n salta sobre el perro perezoso
Most of these words are part of 35k words from the full set of the MUSE dataset [6], which was
used to generate the universal concepts. The only word that is missing from the training set is
lazy. The pre-processing step 1 based on this training set will generate the following modified
texts:
• English: uuid1 uuid2 uuid3 uuid4 uuid5 lazy uuid6
• German: uuid1 uuid2 uuid3 uuid4 uuid5 faulen uuid6
• Spanish: uuid2 uuid1 uuid3 uuid4 uuid5 uuid6 perezoso
The modified texts are used as input to the processing step 2 that will learn the word embeddings
for all tokens including universal concepts and the remaining words. The post-processing step 3
will assign the vectors computed for universal concepts to each word belonging to those concepts,
e.g., the same vector will be assigned to the words fox, fuchs, and zorro as they share the same
UUID.
Table 1: Comparison of bilingual word semantic similarity performance between bilingual universal
concepts word embeddings with window of 10.
Table 2: Performance on word similarities for German and Spanish. Comparing the original mono-lingual
embeddings of their mapping to English.
4. EXPERIMENTS
4.1. Overview
We tested our model using bilingual word semantic similarity and word semantic relatedness
tasks. Both training and evaluation sets are from the MUSE benchmark [6]. To start, we used a
pair of bilingual training sets (En-Es and Es-En, De-En and En-De), such that each set had 35k
origin words, to build the bilingual universal concepts (BUCWE). Then, we evaluated the training
sets performance on a bilingual word semantic similarity task and a word sematic relatedness
task. Finally, we combined two sets of bilingual dictionaries with the 35k origin words to build a
unified multilingual universal concepts model supporting all three languages (MUCWE) and
evaluated it with word similarity task.
Method En-Es Es-En En-De De-En
RSCLS([11]) 51.54 46.77 54.86 11.15
BUCWE 40.95 53.68 52.53 33.53
Dataset Monolingual BUCWE
De-GUR350 0.65 0.67
Es-WS353 0.54 0.60
6. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
18
4.2. Bilingual Word Semantic Similarity Task
In this task, we used the MUSE test set to evaluate our bilingual universal concepts word embeddings
(BUCWE) and bilingual word embeddings provided by the MUSE project. First, we computed a list of the
ten most similar words for each word in the origin language. Those words are achieved using semantic word
similarity which is defined by the cosine similarity between word vectors. If the translation of the word was in
the list, it passed the test, otherwise it failed. Finally, the number of word-pairs that passed the test divided by
number of all tested pairs creates a score for this task. This is a deviation of the word translation task used by
Mikolov et al. [14]. As is shown in Table 1, our purposed model consistently performs better on
XX-En tests. In contrast, RSCLS has had better results on En-XX tests.
4.3. Word Semantic Relatedness Task
In this experiment, we measure if the semantics of word vectors are maintained within languages.
We used GUR350 dataset for German and WS353 for Spanish. Each row of these datasets starts
with a pair of words and a human judgment relatedness score. For testing our word vector, first,
we calculate the cosine similarity between the given word pair vectors. Next, we compute the
Spearman correlation coefficients between our computed cosine similarity and the human
judgment scores. A greater coefficient word vector means a higher quality word vector. Table 2
shows that our approach does not have a negative impact on the quality of the generated word
vectors.
4.4. Multilingual Universal Concepts Word Embeddings (MUCWE)
Next, we ran the word sematic relatedness task for MUCWE to determine the impact of adding
more languages to the universal concepts model based on word similarity within each language.
Table 3 shows that MUCWE, consisting of three languages, can represent the relatedness of the
words within each language as well as the word2vec monolingual word vectors.
Table 3: Comparing the performance of word similarities between multilingual universal concepts using
35k words for English, Spanish and German (MUCWE) and monolingual word2vec word vectors.
5. CONCLUSIONS
This paper introduces implementing learning of cross-lingual word embedding based on the
universal concepts approach. Our results show that a cross-lingual word embedding using the
seeds of universal concepts has a competitive advantage on bilingual word semantic similarity
and word semantic relatedness tasks. We also show that the proposed approach does not degrade
the quality of the generated word vectors. Finally, our method features extensibility which is
lacking in most of the state-of-the-art methods.
Our future work includes the support of additional languages, both Romance and Asian, in a
single model. We also plan to perform a study of the core set of universal concepts across
languages through statistical analysis. We are also developing new intuitive intrinsic evaluation
task for word embeddings.
Dataset Es-WS353 De-GUR350
Es-monolingual 0.54 -
De-monolingual - 0.65
MUCWE 0.54 0.63
7. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
19
ACKNOWLEDGEMENTS
I would like to thank the Computer Science Department at The University of Alabama, because
through their funding and support, the faculty and staff had made it possible for me to work on
this very interesting subject.
REFERENCES
[1] Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A.
Smith. Massively multilingual word embeddings. CoRR, abs/1602.01925, 2016.
[2] Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Learning principled bilingual mappings of word
embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages 2289–2294, Austin, Texas, November
2016. Association for Computational Linguistics.
[3] Susan Balandin and Teresa Iacono. Crews, wusses, and whoppas: core and fringe vocabular- ies of
australian meal-break conversations in the workplace. Augmentative and Alternative Communication,
15(2):95–109, 1999.
[4] David R. Beukelman, Rebecca S. Jones, and Mary Rowan. Frequency of word usage by nondisabled
peers in integrated preschool classrooms. AAC: Augmentative and Alternative Communication,
5(4):243–248, 1 1989.
[5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vec- tors
with subword information. CoRR, abs/1607.04606, 2016.
[6] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve´ Je´gou.
Word translation without parallel data. CoRR, abs/1710.04087, 2017.
[7] Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings with
wasserstein procrustes. CoRR, abs/1805.11222, 2018.
[8] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual
learning for machine translation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R.
Garnett, editors, Advances in Neural Information Processing Systems 29, pages 820–828. Curran
Associates, Inc., 2016.
[9] Yedid Hoshen and Lior Wolf. An iterative closest point method for unsupervised word trans- lation.
CoRR, abs/1801.06126, 2018.
[10] Yedid Hoshen and Lior Wolf. Non-adversarial unsupervised word translation. In Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing, pages 469–478.
Association for Computational Linguistics, 2018.
[11] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, and Edouard Grave. Improving supervised
bilingual mapping of word embeddings. CoRR, abs/1804.07745, 2018.
[12] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Herve´ Je´gou, and Edouard Grave. Loss in
translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 -
November 4, 2018, pages 2979–2984, 2018.
[13] Paul J. Leach, Michael Mealling, and Rich Salz. A universally unique identifier (UUID) URN
namespace. RFC, 4122:1–32,2005.
8. International Journal on Web Service Computing (IJWSC), Vol.10, No.1/2/3, September 2019
20
[14] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[15] Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. Exploiting similarities among languages for machine
translation. CoRR, abs/1309.4168,2013.
[16] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), pages 1532–1543, 2014.
[17] Samuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. Of- fline
bilingual word vectors, orthogonal transformations and the inverted softmax. CoRR, abs/1702.03859,
2017.
[18] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal
transform for bilingual word translation. In NAACL HLT 2015, The 2015 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Denver, Colorado, USA, May 31 - June 5, 2015, pages 1006–1011,2015.
[19] Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Adversarial training for unsupervised
bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 1959–1970. Association for
Computational Linguistics, 2017.
AUTHORS
Pezhman Sheinidashtegol received his B.S in Computer Engineering from IAU,
in 2010 and his M.S. degree in Computer Science from Western Kentucky
University, in 2016. He is currently pursuing his Ph.D. in the department of
Computer Science at The University of Alabama. His research interests include
deep learning, text mining, machine learning, social media, data privacy, and
cloud/network security.
Dr. Musaev is an Assistant Professor of Computer Science at the University of
Alabama. He received his Ph.D. (2016), M.S. (2000) and B.S. (1999) degrees in
Computer Science from Georgia Tech. Before doctoral studies, he founded and
managed a software company Akforta that provided enterprise management
solutions. He is a recipient of the Scholarship of the President of the Kyrgyz
Republic from 1997 to 1999. His primary research interests are in applied data
mining of big data, including social media for disaster management and video
traffic data for transportation.