SlideShare a Scribd company logo
1 of 8
Download to read offline
Word2vec on the Italian language: first experiments
Vincenzo Lomonaco1
1
Alma Mater Studiorum - University of Bologna
February 19, 2015
Abstract
Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent
years. The vector representations of words learned by word2vec models have been proven to be able
to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the
previously obtained results for the English language and to explore the possibility of doing the same for
the Italian language.
1 Introduction
Many current NLP systems and techniques treat words as atomic units, there is no notion of similarity
between words, as these are represented as indices in a vocabulary. This choice has several good reasons:
simplicity, robustness and the observation that simple models trained on huge amounts of data outperform
complex systems trained on less data. An example is the popular N-gram model used for statistical language
modeling. However, the simple techniques are at their limits in many tasks. With progress of machine
learning techniques in recent years, it has become possible to train more complex models on much larger
data sets, and they typically outperform the simple models now. Probably, one of the most successful
concept is to use distributed representations of words [2]. For example, neural network based language
models signicantly outperform N-gram models in many cases [[1], [8], [4]]. Word2vec tool was born out
of this trend. It can be used for learning high-quality word vectors from huge data sets with billions of
words, and with millions of words in the vocabulary. As far as I know, none of the previously proposed
architectures has been successfully trained on more than a few hundred of millions of words, with a modest
dimensionality of the word vectors between 50 - 100. The main goal of this work is to validate previously
proposed experiments for the English language (especially exploring how this tool performs on smaller data
sets) and then trying to figure out if it is possible to reproduce the same accuracy and performance with
the Italian language. In section 2, word2vec proposed architectures are rapidly summarized. In section 3, I
present the corpora, the preprocessing and the test sets used. Then, in section 4, I explain in details what
experiments was performed and the results obtained. Lastly, in section 5, I draw the main conclusions.
2 Word2vec models
Many different types of models were proposed for estimating continuous representations of words, including
the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Word2vec computes
distributed representations of words using neural networks, as it was previously shown that they perform
signicantly better than LSA for preserving linear regularities among words [[6], [9]] and they are compu-
tationally cheaper than LDA on large data sets. Practically speaking, word2vec proposes two new model
architectures for learning distributed representations of words that try to minimize computational complex-
ity. The first one is called Continuous Bag-of-Words (CBOW) and is pretty similar to the feedforward
1
Neural Net Language Model (NNLM), where the non-linear hidden layer is removed and the projection layer
is shared for all words. This architecture is called Continuous Bag-Of-Words as the order of words in the
history does not inuence the projection. Furthermore, words from the future are used. the best performance
in the original work was obtained on the task introduced in the next section by building a log-linear classier
with four future and four history words at the input, where the training criterion is to correctly classify the
current (middle) word. The second architecture is similar to CBOW, but instead of predicting the current
word based on the context, it tries to maximize classication of a word based on another word in the same
sentence and it is called Continuous Skip-gram (Skip-gram) 1. More precisely, each current word is used
as an input to a log-linear classier with continuous projection layer to predict words within a certain range
before and after the current word. It is found that increasing the range improves quality of the resulting
word vectors, but it also increases the computational complexity.
Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context,
and the Skip-gram predicts surrounding words given the current word.
3 Corpora and test sets
Due to computational and memory limits, I was forced to consider only small data sets and to compute
word vectors on them. In order to underline the correlation between the data set dimension and the word
vectors quality, I decided to prepare two data set for each language: The former of 100MB and the latter
of 200MB. For the English language I choose to use a 200MB chunk of the “One Billion Word Language
Modeling Benchmark” that is a tokenized corpus provided by Google. I futher considered the small sampled
and already prerocessed version of the Wikipedia dump corpus that cames natively with word2vec as a demo
data set to make some comparations. For the italian language I choose to directly sampling the plain-text
ItWaC corpus (that counts more than 2 billion words) and reduce it to our demo size of 200MB. In table 1,
futher information about each data sets are provided.
2
Table 1: Data sets summary
Lang Name Size Vocab Size Words number Encoding
Eng 1BWLMB 100MB 60745 18037497 utf-8
Eng text8 100MB 71291 16718843 utf-8
Eng 1BWLMB 200MB 81746 34756679 utf-8
Ita ItWac 100MB 90486 16691286 utf-8
Ita ItWac 200MB 125625 33394879 utf-8
Before submitting the text directly to word2vec we need some preprocessing. In fact, word2vec gives its best
when:
• Punctuation and special characters are removed
• Words are converted to lowercase
• Numerals are converted to their word forms (e.g. 1996 becomes one nine nine six)
Even if these preprocessing steps are not really necessary, they can improve the accuracy and be useful for
some kind of applications. In this work I decided to remove only punctuation and special characters. Thus,
I wrote some scripts in Python in order to pre-process all the data in the same way.
With the respect to the test sets, word2vec team provided a specific set of “questions” to evaluate the
word vectors accuracy. Although it is easy to show that word France is similar to Italy and perhaps some
other countries, it is much more challenging when subjecting those vectors in a more complex similarity
task. The authors follow previous observations that there can be many different types of similarities between
words, for example, word big is similar to bigger in the same sense that small is similar to smaller. On these
premises they denote two pairs of words with the same relationship as a question, as we can ask: “What
is the word that is similar to small in the same sense as biggest is similar to big?” Somewhat surprisingly,
these questions can be answered by performing simple algebraic operations with the vector representation
of words. To find a word that is similar to small in the same sense as biggest is similar to big, we can
simply compute vector X = vector(“biggest”) - vector(“big”) + vector(“small”). Then, the word closest to
X measured by cosine distance could be searched in the vector space and be used as the answer to the question.
Thus, to measure quality of the word vectors, a comprehensive test set that contains five types of se-
mantic questions, and nine types of syntactic questions was defined. Two examples from each category are
shown in Figure 2. Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each
category were created in two steps: first, a list of similar word pairs was created manually. Then, a large
list of questions is formed by connecting two word pairs. For example, the authors made a list of 68 large
American cities and the states they belong to, and formed about 2.5K questions by picking two word pairs
at random. Only single token words are included, thus multi-word entities are not present (such as New
York). Then the overall accuracy is evaluated for all question types, and for each question type separately
(semantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector
computed using the above method is exactly the same as the correct word in the question; synonyms are
thus counted as mistakes. This also means that reaching 100% accuracy is likely to be impossible, as the
current models do not have any input information about word morphology.
3
Figure 2: Examples of five types of semantic and nine types of syntactic questions in the Semantic- Syntactic
Word Relationship test set for the English language.
The same test set has been carried in this work manually translating the original word pairs in the Italian
language and than automatically forming questions by connecting two word pairs in all their possible com-
binations. However, It is worth saying that sometimes switching from one language to another could lead
to meaningless questions. Some words, in fact, are more common than others or can not be translated in a
single word in the other language. All possible efforts have been made to mantain the correspondence while
preserving the meaning of the test and in the following sections we will try to figure out the goodness of this
approach.
4 Experiments and results
In this section we present two main experiments performed in parallel on the English and Italian sampled
corpora. The first one is the accuracy measure computed with the test sets on different data sets and the
second one is an exploratory analysis over an unsupervised words clustering attempt.
4.1 Test sets experiments
All the experiments were performed using the Skip-gram model and a vector size of 200. For the other
options default word2vec values were chosen. In Tab. 2 we can see the results directly reported from the
original compute-accuracy code for the English language through the demo text8 data set. Total possible
questions are 19K, but only a 60% has been used for the purpose. An average of 26% of accuracy is reached
with the TOP-1 metric, meaning that the answer is exactly the first ranked among all the guesses. This is
an accettable result giving that all synonims are considered wrong and the data set is very small.
4
Table 2: English text8 accuracy
Specific 110MB-TOP1
capital-common-countries 37.35% (189 / 506)
capital-world 23.35% (339 / 1452)
currency 6.34% (17 / 268)
city-in-state 19.22% (302 / 1571)
family 52.94% (162 / 306)
gram1-adjective-to-adverb 5.16% (39 / 756)
gram2-opposite 13.07% (40 / 306)
gram3-comparative 31.43% (396 / 1260)
gram4-superlative 12.45% (63 / 506)
gram5-present-participle 14.11% (140 / 992)
gram6-nationality-adjective 56.24% (771 / 1371)
gram7-past-tense 18.17% (242 / 1332)
gram8-plural 40.12% (398 / 992)
gram9-plural-verbs 17.38% (113 / 650)
Average: 26,17% (3211/12268)
Questions used / total: 62.77% (12268/19544)
Let us consider now the given results for the “One Billion Word Language Modeling Benchmark” in Tab. 3.
According to the table, results are slightly worse on this data set reaching an average accuracy of 19,77%.
Increasing the data, however, it jumps to 29,98%. This is a huge improvement considering that only another
chunk of 100MB of data was added.
Table 3: English 1BWLMB accuracy
Specific 100MB-TOP1 200MB-TOP1
capital-common-countries 32.81% (166 / 506) 46.64% (236 / 506)
capital-world 20.37% (355 / 1743) 35.20% (598 / 1699)
currency 1.56% (2 / 128) 6.48% (7 / 108)
city-in-state 9.18% (184 / 2005) 13.34% (267 / 2001)
family 52.11% (198 / 380) 54.21% (206 / 380)
gram1-adjective-to-adverb 1.72% (16 / 930) 3.01% (28 / 930)
gram2-opposite 5.53% (28 / 506) 6.06% (28 / 462)
gram3-comparative 38.06% (507 / 1332) 52.25% (696 / 1332)
gram4-superlative 11.95% (97 / 812) 20.57% (167 / 812)
gram5-present-participle 18.35% (182 / 992) 28.83% (286 / 992)
gram6-nationality-adjective 30.11% (370 / 1229) 47.76% (554 / 1160)
gram7-past-tense 22.95% (358 / 1560) 35.38% (552 / 1560)
gram8-plural 16.29% (172 / 1056) 28.25% (317 / 1122)
gram9-plural-verbs 18.46% (120 / 650) 26.15% (170 / 650)
Average: 19,77% (2735/13829) 29,98% (4112/13714)
Questions used / total: 70.76% (13829/19544) 70.17% (13714/19544)
5
Moving to the italian language, the accuracy results are given in Tab. 4. First of all, the third grammar
section was removed from the test set since it was impossible to translate comparative in a single word
in Italian. With respect to the accuracy, it is possible to see a huge drop basically in every test section,
comparing the accuracy with the corrisponding English results. The main motivations of this fall concern the
original corpora and the sampled data sets. Let us consider the section city-in-state that drops from 19% to
nearly 1% for example. This section is based on questions referring USA states and cities. It is clear that in
Wikipedia or in the Google News corpus there are much more entities of this kind rather than in the ItWaC
corpus that is made by crawling random it domains and is not as clean as the other two. The same can be
said for the other sections, especially non syntactic ones. However, even in some sections in which a better
accuracy performance was aspected (like in the plural section) we see a steep drop. There are many possible
explanations for this fall. First of all because, Italian, like other european languages, is morphologically
richer than English implying that more data are needed to reach the same accuracy. Moreover, selected
words in the questions are more common in English and it could not be necessarily said the same for the
Italian translation. This phenomenon may imply a large difference in accuracy on small data sets, expecially
with a neural network model. But to have the last word, a larger number of experiments have to be done,
varying the dimention of the data sets up to the state-of-the-art data set dimension and using a more ad-hoc
benchmark for evaluating the quality of word vectors in Italian. However, also in this case, increasing the
dimension of the data set leads to a good improvement in accuracy in almost all the test sections with an
overall accuracy improvement of a few points percentage.
Table 4: Italian ItWaC accuracy
Specific 100MB-TOP1 200MB-TOP1
capital-common-countries 8.01% (37 / 462) 15.37% (71 / 462)
capital-world 4.56% (32 / 702) 8.51% (74 / 870)
currency 0.00% (0 / 156) 0.00% (0 / 182)
city-in-state 1.19% (9 / 756) 3.63% (36 / 992)
family 6.76% (25 / 370) 9.90% (49 / 495)
gram1-adjective-to-adverb 1.03% (9 / 870) 3.68% (32 / 870)
gram2-opposite 0.15% (1 / 650) 2.85% (20 / 702)
gram4-superlative 0.71% (5 / 702) 3.98% (37 / 930)
gram5-present-participle 6.99% (65 / 930) 12.90% (128 / 992)
gram6-nationality-adjective 2.06% (26 / 1260) 5.00% (78 / 1560)
gram7-past-tense 1.25% (14 / 1122) 3.68% (49 / 1332)
gram8-plural 3.49% (44 / 1260) 5.16% (65 / 1260)
gram9-plural-verbs 16.32% (142 / 870) 28.85% (251 / 870)
Average: 4,12% (417/10110) 7,73% (890/10110)
Questions used / total: 71.08% (10110/14223) 80.97% (11517/14223)
4.2 Clustering experments
The word vectors can be also used for deriving word classes from huge data sets. This is achieved by
performing K-means clustering on top of the word vectors. The word2vec original release contains the
straight C implementation and a shell script that demonstrates its uses. The output is a vocabulary file with
words and their corresponding class IDs. Some examples for both language are provided in the Tab. 5 and
Tab. 6 (Note that there is no correlation between words of different languages: they are put toghether only
for formatting purposes).
Even if the user can set a specific number of output classes, semantic and syntactic similarities are difficult
6
to separate. This could be good for some applications and less good for others. Through a brief exploratory
analysis the classes quality seems to be similar for both language, but only a more strictly test with the help
of a wordnet could validate this argument.
Table 5: Clustering Examples
English Italian
carnivores 234 bancario 10
carnivorous 234 bonifico 10
cetaceans 234 cambiale 10
cormorant 234 cambiali 10
coyotes 234 cambiari 10
crocodile 234 correntista 10
crocodiles 234 costitutore 10
crustaceans 234 credito 10
cultivated 234 debitore 10
danios 234 denaro 10
Table 6: Clustering Examples
English Italian
acceptance 412 menzogneri 341
argue 412 minacciare 341
argues 412 minando 341
arguing 412 minato 341
argument 412 mistificazione 341
arguments 412 nefasta 341
belief 412 opponendo 341
believe 412 opponendosi 341
challenge 412 oppressa 341
claim 412 oppressore 341
5 Conclusion
In this work we have tried to understand word2vec, a well known tool for learning high-quality word vectors,
and to reproduce to some extent the results obtained in the original work for the English language. Moreover,
we aimed to start bringing the same experiments to the Italian language and see what happens. Using much
different corpora of limited size as well as translating directly the test set without an accurate linguistic
revision has led to a very low accuracy level for the Italian language. Varying the number of words up to
billions and the size of the vocabulary could certanly raise the total level of accuracy. The next step, however,
would be to construct a new test set from scratch and use it as a benchmark for trying all the most used
word space models in the context of the Italian language. In the end, I am confident that, with appropriate
efforts, it would be possible to use word2vec and its different parallel implementations the same way as they
are used for the English language, reaching the state-of-the-art in terms of both performance and accuracy.
7
References
[1] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language
model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
[2] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed processing.
Explorations in the microstructure of cognition, 2:216–271, 1986.
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations
in vector space. arXiv preprint arXiv:1301.3781, 2013.
[4] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernock`y. Empirical evaluation
and combination of advanced language modeling techniques. In INTERSPEECH, number s 1, pages 605–
608, 2011.
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations
of words and phrases and their compositionality. In Advances in Neural Information Processing Systems,
pages 3111–3119, 2013.
[6] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word
representations. In HLT-NAACL, pages 746–751, 2013.
[7] Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
[8] Holger Schwenk. Continuous space language models. Computer Speech & Language, 21(3):492–518, 2007.
[9] Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. Combining heteroge-
neous models for measuring relational similarity. In HLT-NAACL, pages 1000–1009, 2013.
8

More Related Content

What's hot

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Daniele Di Mitri
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰ssuserc35c0e
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYijnlc
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 

What's hot (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
 
Probabilistic content models,
Probabilistic content models,Probabilistic content models,
Probabilistic content models,
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 

Viewers also liked

Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti PreliminariWord Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti PreliminariNet7
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyricsFrancesco Cucari
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...a3labdsp
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015François Scharffe
 
CNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian TweetsCNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian TweetsGiuseppe Attardi
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgnirudra Sikdar
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討Billy Yang
 
Running Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpRunning Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpBilly Yang
 

Viewers also liked (10)

Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti PreliminariWord Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
 
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
CNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian TweetsCNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian Tweets
 
Word2vec 4 all
Word2vec 4 allWord2vec 4 all
Word2vec 4 all
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity management
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討
 
Running Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpRunning Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dump
 

Similar to Word2vec models on Italian language experiments

Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data scienceNikhil Jaiswal
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
[Emnlp] what is glo ve part iii - towards data science
[Emnlp] what is glo ve  part iii - towards data science[Emnlp] what is glo ve  part iii - towards data science
[Emnlp] what is glo ve part iii - towards data scienceNikhil Jaiswal
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data scienceNikhil Jaiswal
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in textunyil96
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language modelc sharada
 

Similar to Word2vec models on Italian language experiments (20)

Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
PDFTextProcessing
PDFTextProcessingPDFTextProcessing
PDFTextProcessing
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
[Emnlp] what is glo ve part iii - towards data science
[Emnlp] what is glo ve  part iii - towards data science[Emnlp] what is glo ve  part iii - towards data science
[Emnlp] what is glo ve part iii - towards data science
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Doc format.
Doc format.Doc format.
Doc format.
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
FinalReport
FinalReportFinalReport
FinalReport
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 

More from Vincenzo Lomonaco

2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdfVincenzo Lomonaco
 
Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Vincenzo Lomonaco
 
Toward Continual Learning on the Edge
Toward Continual Learning on the EdgeToward Continual Learning on the Edge
Toward Continual Learning on the EdgeVincenzo Lomonaco
 
Continual Learning: Another Step Towards Truly Intelligent Machines
Continual Learning: Another Step Towards Truly Intelligent MachinesContinual Learning: Another Step Towards Truly Intelligent Machines
Continual Learning: Another Step Towards Truly Intelligent MachinesVincenzo Lomonaco
 
Continual Reinforcement Learning in 3D Non-stationary Environments
Continual Reinforcement Learning in 3D Non-stationary EnvironmentsContinual Reinforcement Learning in 3D Non-stationary Environments
Continual Reinforcement Learning in 3D Non-stationary EnvironmentsVincenzo Lomonaco
 
Continual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep ArchitecturesContinual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep ArchitecturesVincenzo Lomonaco
 
Continual Learning for Robotics
Continual Learning for RoboticsContinual Learning for Robotics
Continual Learning for RoboticsVincenzo Lomonaco
 
Don't forget, there is more than forgetting: new metrics for Continual Learni...
Don't forget, there is more than forgetting: new metrics for Continual Learni...Don't forget, there is more than forgetting: new metrics for Continual Learni...
Don't forget, there is more than forgetting: new metrics for Continual Learni...Vincenzo Lomonaco
 
Open-Source Frameworks for Deep Learning: an Overview
Open-Source Frameworks for Deep Learning: an OverviewOpen-Source Frameworks for Deep Learning: an Overview
Open-Source Frameworks for Deep Learning: an OverviewVincenzo Lomonaco
 
Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...
Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...
Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...Vincenzo Lomonaco
 
CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...
CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...
CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...Vincenzo Lomonaco
 
Continuous Learning with Deep Architectures
Continuous Learning with Deep ArchitecturesContinuous Learning with Deep Architectures
Continuous Learning with Deep ArchitecturesVincenzo Lomonaco
 
CORe50: a New Dataset and Benchmark for Continuous Object Recognition Poster
CORe50: a New Dataset and Benchmark for Continuous Object Recognition PosterCORe50: a New Dataset and Benchmark for Continuous Object Recognition Poster
CORe50: a New Dataset and Benchmark for Continuous Object Recognition PosterVincenzo Lomonaco
 
Continuous Unsupervised Training of Deep Architectures
Continuous Unsupervised Training of Deep ArchitecturesContinuous Unsupervised Training of Deep Architectures
Continuous Unsupervised Training of Deep ArchitecturesVincenzo Lomonaco
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksVincenzo Lomonaco
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Vincenzo Lomonaco
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Vincenzo Lomonaco
 
A Framework for Deadlock Detection in Java
A Framework for Deadlock Detection in JavaA Framework for Deadlock Detection in Java
A Framework for Deadlock Detection in JavaVincenzo Lomonaco
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoVincenzo Lomonaco
 

More from Vincenzo Lomonaco (20)

2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
 
Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021Continual Learning with Deep Architectures - Tutorial ICML 2021
Continual Learning with Deep Architectures - Tutorial ICML 2021
 
Toward Continual Learning on the Edge
Toward Continual Learning on the EdgeToward Continual Learning on the Edge
Toward Continual Learning on the Edge
 
Continual Learning: Another Step Towards Truly Intelligent Machines
Continual Learning: Another Step Towards Truly Intelligent MachinesContinual Learning: Another Step Towards Truly Intelligent Machines
Continual Learning: Another Step Towards Truly Intelligent Machines
 
Tutorial inns2019 full
Tutorial inns2019 fullTutorial inns2019 full
Tutorial inns2019 full
 
Continual Reinforcement Learning in 3D Non-stationary Environments
Continual Reinforcement Learning in 3D Non-stationary EnvironmentsContinual Reinforcement Learning in 3D Non-stationary Environments
Continual Reinforcement Learning in 3D Non-stationary Environments
 
Continual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep ArchitecturesContinual/Lifelong Learning with Deep Architectures
Continual/Lifelong Learning with Deep Architectures
 
Continual Learning for Robotics
Continual Learning for RoboticsContinual Learning for Robotics
Continual Learning for Robotics
 
Don't forget, there is more than forgetting: new metrics for Continual Learni...
Don't forget, there is more than forgetting: new metrics for Continual Learni...Don't forget, there is more than forgetting: new metrics for Continual Learni...
Don't forget, there is more than forgetting: new metrics for Continual Learni...
 
Open-Source Frameworks for Deep Learning: an Overview
Open-Source Frameworks for Deep Learning: an OverviewOpen-Source Frameworks for Deep Learning: an Overview
Open-Source Frameworks for Deep Learning: an Overview
 
Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...
Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...
Continual Learning with Deep Architectures Workshop @ Computer VISIONers Conf...
 
CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...
CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...
CORe50: a New Dataset and Benchmark for Continual Learning and Object Recogni...
 
Continuous Learning with Deep Architectures
Continuous Learning with Deep ArchitecturesContinuous Learning with Deep Architectures
Continuous Learning with Deep Architectures
 
CORe50: a New Dataset and Benchmark for Continuous Object Recognition Poster
CORe50: a New Dataset and Benchmark for Continuous Object Recognition PosterCORe50: a New Dataset and Benchmark for Continuous Object Recognition Poster
CORe50: a New Dataset and Benchmark for Continuous Object Recognition Poster
 
Continuous Unsupervised Training of Deep Architectures
Continuous Unsupervised Training of Deep ArchitecturesContinuous Unsupervised Training of Deep Architectures
Continuous Unsupervised Training of Deep Architectures
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
 
A Framework for Deadlock Detection in Java
A Framework for Deadlock Detection in JavaA Framework for Deadlock Detection in Java
A Framework for Deadlock Detection in Java
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with Theano
 

Word2vec models on Italian language experiments

  • 1. Word2vec on the Italian language: first experiments Vincenzo Lomonaco1 1 Alma Mater Studiorum - University of Bologna February 19, 2015 Abstract Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language. 1 Introduction Many current NLP systems and techniques treat words as atomic units, there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons: simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling. However, the simple techniques are at their limits in many tasks. With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data sets, and they typically outperform the simple models now. Probably, one of the most successful concept is to use distributed representations of words [2]. For example, neural network based language models signicantly outperform N-gram models in many cases [[1], [8], [4]]. Word2vec tool was born out of this trend. It can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as I know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100. The main goal of this work is to validate previously proposed experiments for the English language (especially exploring how this tool performs on smaller data sets) and then trying to figure out if it is possible to reproduce the same accuracy and performance with the Italian language. In section 2, word2vec proposed architectures are rapidly summarized. In section 3, I present the corpora, the preprocessing and the test sets used. Then, in section 4, I explain in details what experiments was performed and the results obtained. Lastly, in section 5, I draw the main conclusions. 2 Word2vec models Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Word2vec computes distributed representations of words using neural networks, as it was previously shown that they perform signicantly better than LSA for preserving linear regularities among words [[6], [9]] and they are compu- tationally cheaper than LDA on large data sets. Practically speaking, word2vec proposes two new model architectures for learning distributed representations of words that try to minimize computational complex- ity. The first one is called Continuous Bag-of-Words (CBOW) and is pretty similar to the feedforward 1
  • 2. Neural Net Language Model (NNLM), where the non-linear hidden layer is removed and the projection layer is shared for all words. This architecture is called Continuous Bag-Of-Words as the order of words in the history does not inuence the projection. Furthermore, words from the future are used. the best performance in the original work was obtained on the task introduced in the next section by building a log-linear classier with four future and four history words at the input, where the training criterion is to correctly classify the current (middle) word. The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classication of a word based on another word in the same sentence and it is called Continuous Skip-gram (Skip-gram) 1. More precisely, each current word is used as an input to a log-linear classier with continuous projection layer to predict words within a certain range before and after the current word. It is found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. 3 Corpora and test sets Due to computational and memory limits, I was forced to consider only small data sets and to compute word vectors on them. In order to underline the correlation between the data set dimension and the word vectors quality, I decided to prepare two data set for each language: The former of 100MB and the latter of 200MB. For the English language I choose to use a 200MB chunk of the “One Billion Word Language Modeling Benchmark” that is a tokenized corpus provided by Google. I futher considered the small sampled and already prerocessed version of the Wikipedia dump corpus that cames natively with word2vec as a demo data set to make some comparations. For the italian language I choose to directly sampling the plain-text ItWaC corpus (that counts more than 2 billion words) and reduce it to our demo size of 200MB. In table 1, futher information about each data sets are provided. 2
  • 3. Table 1: Data sets summary Lang Name Size Vocab Size Words number Encoding Eng 1BWLMB 100MB 60745 18037497 utf-8 Eng text8 100MB 71291 16718843 utf-8 Eng 1BWLMB 200MB 81746 34756679 utf-8 Ita ItWac 100MB 90486 16691286 utf-8 Ita ItWac 200MB 125625 33394879 utf-8 Before submitting the text directly to word2vec we need some preprocessing. In fact, word2vec gives its best when: • Punctuation and special characters are removed • Words are converted to lowercase • Numerals are converted to their word forms (e.g. 1996 becomes one nine nine six) Even if these preprocessing steps are not really necessary, they can improve the accuracy and be useful for some kind of applications. In this work I decided to remove only punctuation and special characters. Thus, I wrote some scripts in Python in order to pre-process all the data in the same way. With the respect to the test sets, word2vec team provided a specific set of “questions” to evaluate the word vectors accuracy. Although it is easy to show that word France is similar to Italy and perhaps some other countries, it is much more challenging when subjecting those vectors in a more complex similarity task. The authors follow previous observations that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller. On these premises they denote two pairs of words with the same relationship as a question, as we can ask: “What is the word that is similar to small in the same sense as biggest is similar to big?” Somewhat surprisingly, these questions can be answered by performing simple algebraic operations with the vector representation of words. To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector X = vector(“biggest”) - vector(“big”) + vector(“small”). Then, the word closest to X measured by cosine distance could be searched in the vector space and be used as the answer to the question. Thus, to measure quality of the word vectors, a comprehensive test set that contains five types of se- mantic questions, and nine types of syntactic questions was defined. Two examples from each category are shown in Figure 2. Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each category were created in two steps: first, a list of similar word pairs was created manually. Then, a large list of questions is formed by connecting two word pairs. For example, the authors made a list of 68 large American cities and the states they belong to, and formed about 2.5K questions by picking two word pairs at random. Only single token words are included, thus multi-word entities are not present (such as New York). Then the overall accuracy is evaluated for all question types, and for each question type separately (semantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes. This also means that reaching 100% accuracy is likely to be impossible, as the current models do not have any input information about word morphology. 3
  • 4. Figure 2: Examples of five types of semantic and nine types of syntactic questions in the Semantic- Syntactic Word Relationship test set for the English language. The same test set has been carried in this work manually translating the original word pairs in the Italian language and than automatically forming questions by connecting two word pairs in all their possible com- binations. However, It is worth saying that sometimes switching from one language to another could lead to meaningless questions. Some words, in fact, are more common than others or can not be translated in a single word in the other language. All possible efforts have been made to mantain the correspondence while preserving the meaning of the test and in the following sections we will try to figure out the goodness of this approach. 4 Experiments and results In this section we present two main experiments performed in parallel on the English and Italian sampled corpora. The first one is the accuracy measure computed with the test sets on different data sets and the second one is an exploratory analysis over an unsupervised words clustering attempt. 4.1 Test sets experiments All the experiments were performed using the Skip-gram model and a vector size of 200. For the other options default word2vec values were chosen. In Tab. 2 we can see the results directly reported from the original compute-accuracy code for the English language through the demo text8 data set. Total possible questions are 19K, but only a 60% has been used for the purpose. An average of 26% of accuracy is reached with the TOP-1 metric, meaning that the answer is exactly the first ranked among all the guesses. This is an accettable result giving that all synonims are considered wrong and the data set is very small. 4
  • 5. Table 2: English text8 accuracy Specific 110MB-TOP1 capital-common-countries 37.35% (189 / 506) capital-world 23.35% (339 / 1452) currency 6.34% (17 / 268) city-in-state 19.22% (302 / 1571) family 52.94% (162 / 306) gram1-adjective-to-adverb 5.16% (39 / 756) gram2-opposite 13.07% (40 / 306) gram3-comparative 31.43% (396 / 1260) gram4-superlative 12.45% (63 / 506) gram5-present-participle 14.11% (140 / 992) gram6-nationality-adjective 56.24% (771 / 1371) gram7-past-tense 18.17% (242 / 1332) gram8-plural 40.12% (398 / 992) gram9-plural-verbs 17.38% (113 / 650) Average: 26,17% (3211/12268) Questions used / total: 62.77% (12268/19544) Let us consider now the given results for the “One Billion Word Language Modeling Benchmark” in Tab. 3. According to the table, results are slightly worse on this data set reaching an average accuracy of 19,77%. Increasing the data, however, it jumps to 29,98%. This is a huge improvement considering that only another chunk of 100MB of data was added. Table 3: English 1BWLMB accuracy Specific 100MB-TOP1 200MB-TOP1 capital-common-countries 32.81% (166 / 506) 46.64% (236 / 506) capital-world 20.37% (355 / 1743) 35.20% (598 / 1699) currency 1.56% (2 / 128) 6.48% (7 / 108) city-in-state 9.18% (184 / 2005) 13.34% (267 / 2001) family 52.11% (198 / 380) 54.21% (206 / 380) gram1-adjective-to-adverb 1.72% (16 / 930) 3.01% (28 / 930) gram2-opposite 5.53% (28 / 506) 6.06% (28 / 462) gram3-comparative 38.06% (507 / 1332) 52.25% (696 / 1332) gram4-superlative 11.95% (97 / 812) 20.57% (167 / 812) gram5-present-participle 18.35% (182 / 992) 28.83% (286 / 992) gram6-nationality-adjective 30.11% (370 / 1229) 47.76% (554 / 1160) gram7-past-tense 22.95% (358 / 1560) 35.38% (552 / 1560) gram8-plural 16.29% (172 / 1056) 28.25% (317 / 1122) gram9-plural-verbs 18.46% (120 / 650) 26.15% (170 / 650) Average: 19,77% (2735/13829) 29,98% (4112/13714) Questions used / total: 70.76% (13829/19544) 70.17% (13714/19544) 5
  • 6. Moving to the italian language, the accuracy results are given in Tab. 4. First of all, the third grammar section was removed from the test set since it was impossible to translate comparative in a single word in Italian. With respect to the accuracy, it is possible to see a huge drop basically in every test section, comparing the accuracy with the corrisponding English results. The main motivations of this fall concern the original corpora and the sampled data sets. Let us consider the section city-in-state that drops from 19% to nearly 1% for example. This section is based on questions referring USA states and cities. It is clear that in Wikipedia or in the Google News corpus there are much more entities of this kind rather than in the ItWaC corpus that is made by crawling random it domains and is not as clean as the other two. The same can be said for the other sections, especially non syntactic ones. However, even in some sections in which a better accuracy performance was aspected (like in the plural section) we see a steep drop. There are many possible explanations for this fall. First of all because, Italian, like other european languages, is morphologically richer than English implying that more data are needed to reach the same accuracy. Moreover, selected words in the questions are more common in English and it could not be necessarily said the same for the Italian translation. This phenomenon may imply a large difference in accuracy on small data sets, expecially with a neural network model. But to have the last word, a larger number of experiments have to be done, varying the dimention of the data sets up to the state-of-the-art data set dimension and using a more ad-hoc benchmark for evaluating the quality of word vectors in Italian. However, also in this case, increasing the dimension of the data set leads to a good improvement in accuracy in almost all the test sections with an overall accuracy improvement of a few points percentage. Table 4: Italian ItWaC accuracy Specific 100MB-TOP1 200MB-TOP1 capital-common-countries 8.01% (37 / 462) 15.37% (71 / 462) capital-world 4.56% (32 / 702) 8.51% (74 / 870) currency 0.00% (0 / 156) 0.00% (0 / 182) city-in-state 1.19% (9 / 756) 3.63% (36 / 992) family 6.76% (25 / 370) 9.90% (49 / 495) gram1-adjective-to-adverb 1.03% (9 / 870) 3.68% (32 / 870) gram2-opposite 0.15% (1 / 650) 2.85% (20 / 702) gram4-superlative 0.71% (5 / 702) 3.98% (37 / 930) gram5-present-participle 6.99% (65 / 930) 12.90% (128 / 992) gram6-nationality-adjective 2.06% (26 / 1260) 5.00% (78 / 1560) gram7-past-tense 1.25% (14 / 1122) 3.68% (49 / 1332) gram8-plural 3.49% (44 / 1260) 5.16% (65 / 1260) gram9-plural-verbs 16.32% (142 / 870) 28.85% (251 / 870) Average: 4,12% (417/10110) 7,73% (890/10110) Questions used / total: 71.08% (10110/14223) 80.97% (11517/14223) 4.2 Clustering experments The word vectors can be also used for deriving word classes from huge data sets. This is achieved by performing K-means clustering on top of the word vectors. The word2vec original release contains the straight C implementation and a shell script that demonstrates its uses. The output is a vocabulary file with words and their corresponding class IDs. Some examples for both language are provided in the Tab. 5 and Tab. 6 (Note that there is no correlation between words of different languages: they are put toghether only for formatting purposes). Even if the user can set a specific number of output classes, semantic and syntactic similarities are difficult 6
  • 7. to separate. This could be good for some applications and less good for others. Through a brief exploratory analysis the classes quality seems to be similar for both language, but only a more strictly test with the help of a wordnet could validate this argument. Table 5: Clustering Examples English Italian carnivores 234 bancario 10 carnivorous 234 bonifico 10 cetaceans 234 cambiale 10 cormorant 234 cambiali 10 coyotes 234 cambiari 10 crocodile 234 correntista 10 crocodiles 234 costitutore 10 crustaceans 234 credito 10 cultivated 234 debitore 10 danios 234 denaro 10 Table 6: Clustering Examples English Italian acceptance 412 menzogneri 341 argue 412 minacciare 341 argues 412 minando 341 arguing 412 minato 341 argument 412 mistificazione 341 arguments 412 nefasta 341 belief 412 opponendo 341 believe 412 opponendosi 341 challenge 412 oppressa 341 claim 412 oppressore 341 5 Conclusion In this work we have tried to understand word2vec, a well known tool for learning high-quality word vectors, and to reproduce to some extent the results obtained in the original work for the English language. Moreover, we aimed to start bringing the same experiments to the Italian language and see what happens. Using much different corpora of limited size as well as translating directly the test set without an accurate linguistic revision has led to a very low accuracy level for the Italian language. Varying the number of words up to billions and the size of the vocabulary could certanly raise the total level of accuracy. The next step, however, would be to construct a new test set from scratch and use it as a benchmark for trying all the most used word space models in the context of the Italian language. In the end, I am confident that, with appropriate efforts, it would be possible to use word2vec and its different parallel implementations the same way as they are used for the English language, reaching the state-of-the-art in terms of both performance and accuracy. 7
  • 8. References [1] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155, 2003. [2] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed processing. Explorations in the microstructure of cognition, 2:216–271, 1986. [3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [4] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernock`y. Empirical evaluation and combination of advanced language modeling techniques. In INTERSPEECH, number s 1, pages 605– 608, 2011. [5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013. [6] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013. [7] Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014. [8] Holger Schwenk. Continuous space language models. Computer Speech & Language, 21(3):492–518, 2007. [9] Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. Combining heteroge- neous models for measuring relational similarity. In HLT-NAACL, pages 1000–1009, 2013. 8