Word2vec models on Italian language experiments

Word2vec on the Italian language: first experiments
Vincenzo Lomonaco1
1
Alma Mater Studiorum - University of Bologna
February 19, 2015
Abstract
Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent
years. The vector representations of words learned by word2vec models have been proven to be able
to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the
previously obtained results for the English language and to explore the possibility of doing the same for
the Italian language.
1 Introduction
Many current NLP systems and techniques treat words as atomic units, there is no notion of similarity
between words, as these are represented as indices in a vocabulary. This choice has several good reasons:
simplicity, robustness and the observation that simple models trained on huge amounts of data outperform
complex systems trained on less data. An example is the popular N-gram model used for statistical language
modeling. However, the simple techniques are at their limits in many tasks. With progress of machine
learning techniques in recent years, it has become possible to train more complex models on much larger
data sets, and they typically outperform the simple models now. Probably, one of the most successful
concept is to use distributed representations of words [2]. For example, neural network based language
models signicantly outperform N-gram models in many cases [[1], [8], [4]]. Word2vec tool was born out
of this trend. It can be used for learning high-quality word vectors from huge data sets with billions of
words, and with millions of words in the vocabulary. As far as I know, none of the previously proposed
architectures has been successfully trained on more than a few hundred of millions of words, with a modest
dimensionality of the word vectors between 50 - 100. The main goal of this work is to validate previously
proposed experiments for the English language (especially exploring how this tool performs on smaller data
sets) and then trying to figure out if it is possible to reproduce the same accuracy and performance with
the Italian language. In section 2, word2vec proposed architectures are rapidly summarized. In section 3, I
present the corpora, the preprocessing and the test sets used. Then, in section 4, I explain in details what
experiments was performed and the results obtained. Lastly, in section 5, I draw the main conclusions.
2 Word2vec models
Many different types of models were proposed for estimating continuous representations of words, including
the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Word2vec computes
distributed representations of words using neural networks, as it was previously shown that they perform
signicantly better than LSA for preserving linear regularities among words [[6], [9]] and they are compu-
tationally cheaper than LDA on large data sets. Practically speaking, word2vec proposes two new model
architectures for learning distributed representations of words that try to minimize computational complex-
ity. The first one is called Continuous Bag-of-Words (CBOW) and is pretty similar to the feedforward
1

Neural Net Language Model (NNLM), where the non-linear hidden layer is removed and the projection layer
is shared for all words. This architecture is called Continuous Bag-Of-Words as the order of words in the
history does not inuence the projection. Furthermore, words from the future are used. the best performance
in the original work was obtained on the task introduced in the next section by building a log-linear classier
with four future and four history words at the input, where the training criterion is to correctly classify the
current (middle) word. The second architecture is similar to CBOW, but instead of predicting the current
word based on the context, it tries to maximize classication of a word based on another word in the same
sentence and it is called Continuous Skip-gram (Skip-gram) 1. More precisely, each current word is used
as an input to a log-linear classier with continuous projection layer to predict words within a certain range
before and after the current word. It is found that increasing the range improves quality of the resulting
word vectors, but it also increases the computational complexity.
Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context,
and the Skip-gram predicts surrounding words given the current word.
3 Corpora and test sets
Due to computational and memory limits, I was forced to consider only small data sets and to compute
word vectors on them. In order to underline the correlation between the data set dimension and the word
vectors quality, I decided to prepare two data set for each language: The former of 100MB and the latter
of 200MB. For the English language I choose to use a 200MB chunk of the “One Billion Word Language
Modeling Benchmark” that is a tokenized corpus provided by Google. I futher considered the small sampled
and already prerocessed version of the Wikipedia dump corpus that cames natively with word2vec as a demo
data set to make some comparations. For the italian language I choose to directly sampling the plain-text
ItWaC corpus (that counts more than 2 billion words) and reduce it to our demo size of 200MB. In table 1,
futher information about each data sets are provided.
2

Table 1: Data sets summary
Lang Name Size Vocab Size Words number Encoding
Eng 1BWLMB 100MB 60745 18037497 utf-8
Eng text8 100MB 71291 16718843 utf-8
Eng 1BWLMB 200MB 81746 34756679 utf-8
Ita ItWac 100MB 90486 16691286 utf-8
Ita ItWac 200MB 125625 33394879 utf-8
Before submitting the text directly to word2vec we need some preprocessing. In fact, word2vec gives its best
when:
• Punctuation and special characters are removed
• Words are converted to lowercase
• Numerals are converted to their word forms (e.g. 1996 becomes one nine nine six)
Even if these preprocessing steps are not really necessary, they can improve the accuracy and be useful for
some kind of applications. In this work I decided to remove only punctuation and special characters. Thus,
I wrote some scripts in Python in order to pre-process all the data in the same way.
With the respect to the test sets, word2vec team provided a specific set of “questions” to evaluate the
word vectors accuracy. Although it is easy to show that word France is similar to Italy and perhaps some
other countries, it is much more challenging when subjecting those vectors in a more complex similarity
task. The authors follow previous observations that there can be many different types of similarities between
words, for example, word big is similar to bigger in the same sense that small is similar to smaller. On these
premises they denote two pairs of words with the same relationship as a question, as we can ask: “What
is the word that is similar to small in the same sense as biggest is similar to big?” Somewhat surprisingly,
these questions can be answered by performing simple algebraic operations with the vector representation
of words. To find a word that is similar to small in the same sense as biggest is similar to big, we can
simply compute vector X = vector(“biggest”) - vector(“big”) + vector(“small”). Then, the word closest to
X measured by cosine distance could be searched in the vector space and be used as the answer to the question.
Thus, to measure quality of the word vectors, a comprehensive test set that contains five types of se-
mantic questions, and nine types of syntactic questions was defined. Two examples from each category are
shown in Figure 2. Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each
category were created in two steps: first, a list of similar word pairs was created manually. Then, a large
list of questions is formed by connecting two word pairs. For example, the authors made a list of 68 large
American cities and the states they belong to, and formed about 2.5K questions by picking two word pairs
at random. Only single token words are included, thus multi-word entities are not present (such as New
York). Then the overall accuracy is evaluated for all question types, and for each question type separately
(semantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector
computed using the above method is exactly the same as the correct word in the question; synonyms are
thus counted as mistakes. This also means that reaching 100% accuracy is likely to be impossible, as the
current models do not have any input information about word morphology.
3

Figure 2: Examples of five types of semantic and nine types of syntactic questions in the Semantic- Syntactic
Word Relationship test set for the English language.
The same test set has been carried in this work manually translating the original word pairs in the Italian
language and than automatically forming questions by connecting two word pairs in all their possible com-
binations. However, It is worth saying that sometimes switching from one language to another could lead
to meaningless questions. Some words, in fact, are more common than others or can not be translated in a
single word in the other language. All possible efforts have been made to mantain the correspondence while
preserving the meaning of the test and in the following sections we will try to figure out the goodness of this
approach.
4 Experiments and results
In this section we present two main experiments performed in parallel on the English and Italian sampled
corpora. The first one is the accuracy measure computed with the test sets on different data sets and the
second one is an exploratory analysis over an unsupervised words clustering attempt.
4.1 Test sets experiments
All the experiments were performed using the Skip-gram model and a vector size of 200. For the other
options default word2vec values were chosen. In Tab. 2 we can see the results directly reported from the
original compute-accuracy code for the English language through the demo text8 data set. Total possible
questions are 19K, but only a 60% has been used for the purpose. An average of 26% of accuracy is reached
with the TOP-1 metric, meaning that the answer is exactly the first ranked among all the guesses. This is
an accettable result giving that all synonims are considered wrong and the data set is very small.
4

Table 2: English text8 accuracy
Speciﬁc 110MB-TOP1
capital-common-countries 37.35% (189 / 506)
capital-world 23.35% (339 / 1452)
currency 6.34% (17 / 268)
city-in-state 19.22% (302 / 1571)
family 52.94% (162 / 306)
gram1-adjective-to-adverb 5.16% (39 / 756)
gram2-opposite 13.07% (40 / 306)
gram3-comparative 31.43% (396 / 1260)
gram4-superlative 12.45% (63 / 506)
gram5-present-participle 14.11% (140 / 992)
gram6-nationality-adjective 56.24% (771 / 1371)
gram7-past-tense 18.17% (242 / 1332)
gram8-plural 40.12% (398 / 992)
gram9-plural-verbs 17.38% (113 / 650)
Average: 26,17% (3211/12268)
Questions used / total: 62.77% (12268/19544)
Let us consider now the given results for the “One Billion Word Language Modeling Benchmark” in Tab. 3.
According to the table, results are slightly worse on this data set reaching an average accuracy of 19,77%.
Increasing the data, however, it jumps to 29,98%. This is a huge improvement considering that only another
chunk of 100MB of data was added.
Table 3: English 1BWLMB accuracy
Speciﬁc 100MB-TOP1 200MB-TOP1
capital-common-countries 32.81% (166 / 506) 46.64% (236 / 506)
capital-world 20.37% (355 / 1743) 35.20% (598 / 1699)
currency 1.56% (2 / 128) 6.48% (7 / 108)
city-in-state 9.18% (184 / 2005) 13.34% (267 / 2001)
family 52.11% (198 / 380) 54.21% (206 / 380)
gram1-adjective-to-adverb 1.72% (16 / 930) 3.01% (28 / 930)
gram2-opposite 5.53% (28 / 506) 6.06% (28 / 462)
gram3-comparative 38.06% (507 / 1332) 52.25% (696 / 1332)
gram4-superlative 11.95% (97 / 812) 20.57% (167 / 812)
gram5-present-participle 18.35% (182 / 992) 28.83% (286 / 992)
gram6-nationality-adjective 30.11% (370 / 1229) 47.76% (554 / 1160)
gram7-past-tense 22.95% (358 / 1560) 35.38% (552 / 1560)
gram8-plural 16.29% (172 / 1056) 28.25% (317 / 1122)
gram9-plural-verbs 18.46% (120 / 650) 26.15% (170 / 650)
Average: 19,77% (2735/13829) 29,98% (4112/13714)
Questions used / total: 70.76% (13829/19544) 70.17% (13714/19544)
5

Moving to the italian language, the accuracy results are given in Tab. 4. First of all, the third grammar
section was removed from the test set since it was impossible to translate comparative in a single word
in Italian. With respect to the accuracy, it is possible to see a huge drop basically in every test section,
comparing the accuracy with the corrisponding English results. The main motivations of this fall concern the
original corpora and the sampled data sets. Let us consider the section city-in-state that drops from 19% to
nearly 1% for example. This section is based on questions referring USA states and cities. It is clear that in
Wikipedia or in the Google News corpus there are much more entities of this kind rather than in the ItWaC
corpus that is made by crawling random it domains and is not as clean as the other two. The same can be
said for the other sections, especially non syntactic ones. However, even in some sections in which a better
accuracy performance was aspected (like in the plural section) we see a steep drop. There are many possible
explanations for this fall. First of all because, Italian, like other european languages, is morphologically
richer than English implying that more data are needed to reach the same accuracy. Moreover, selected
words in the questions are more common in English and it could not be necessarily said the same for the
Italian translation. This phenomenon may imply a large difference in accuracy on small data sets, expecially
with a neural network model. But to have the last word, a larger number of experiments have to be done,
varying the dimention of the data sets up to the state-of-the-art data set dimension and using a more ad-hoc
benchmark for evaluating the quality of word vectors in Italian. However, also in this case, increasing the
dimension of the data set leads to a good improvement in accuracy in almost all the test sections with an
overall accuracy improvement of a few points percentage.
Table 4: Italian ItWaC accuracy
Specific 100MB-TOP1 200MB-TOP1
capital-common-countries 8.01% (37 / 462) 15.37% (71 / 462)
capital-world 4.56% (32 / 702) 8.51% (74 / 870)
currency 0.00% (0 / 156) 0.00% (0 / 182)
city-in-state 1.19% (9 / 756) 3.63% (36 / 992)
family 6.76% (25 / 370) 9.90% (49 / 495)
gram1-adjective-to-adverb 1.03% (9 / 870) 3.68% (32 / 870)
gram2-opposite 0.15% (1 / 650) 2.85% (20 / 702)
gram4-superlative 0.71% (5 / 702) 3.98% (37 / 930)
gram5-present-participle 6.99% (65 / 930) 12.90% (128 / 992)
gram6-nationality-adjective 2.06% (26 / 1260) 5.00% (78 / 1560)
gram7-past-tense 1.25% (14 / 1122) 3.68% (49 / 1332)
gram8-plural 3.49% (44 / 1260) 5.16% (65 / 1260)
gram9-plural-verbs 16.32% (142 / 870) 28.85% (251 / 870)
Average: 4,12% (417/10110) 7,73% (890/10110)
Questions used / total: 71.08% (10110/14223) 80.97% (11517/14223)
4.2 Clustering experments
The word vectors can be also used for deriving word classes from huge data sets. This is achieved by
performing K-means clustering on top of the word vectors. The word2vec original release contains the
straight C implementation and a shell script that demonstrates its uses. The output is a vocabulary file with
words and their corresponding class IDs. Some examples for both language are provided in the Tab. 5 and
Tab. 6 (Note that there is no correlation between words of different languages: they are put toghether only
for formatting purposes).
Even if the user can set a specific number of output classes, semantic and syntactic similarities are difficult
6

to separate. This could be good for some applications and less good for others. Through a brief exploratory
analysis the classes quality seems to be similar for both language, but only a more strictly test with the help
of a wordnet could validate this argument.
Table 5: Clustering Examples
English Italian
carnivores 234 bancario 10
carnivorous 234 bonifico 10
cetaceans 234 cambiale 10
cormorant 234 cambiali 10
coyotes 234 cambiari 10
crocodile 234 correntista 10
crocodiles 234 costitutore 10
crustaceans 234 credito 10
cultivated 234 debitore 10
danios 234 denaro 10
Table 6: Clustering Examples
English Italian
acceptance 412 menzogneri 341
argue 412 minacciare 341
argues 412 minando 341
arguing 412 minato 341
argument 412 mistificazione 341
arguments 412 nefasta 341
belief 412 opponendo 341
believe 412 opponendosi 341
challenge 412 oppressa 341
claim 412 oppressore 341
5 Conclusion
In this work we have tried to understand word2vec, a well known tool for learning high-quality word vectors,
and to reproduce to some extent the results obtained in the original work for the English language. Moreover,
we aimed to start bringing the same experiments to the Italian language and see what happens. Using much
different corpora of limited size as well as translating directly the test set without an accurate linguistic
revision has led to a very low accuracy level for the Italian language. Varying the number of words up to
billions and the size of the vocabulary could certanly raise the total level of accuracy. The next step, however,
would be to construct a new test set from scratch and use it as a benchmark for trying all the most used
word space models in the context of the Italian language. In the end, I am confident that, with appropriate
efforts, it would be possible to use word2vec and its different parallel implementations the same way as they
are used for the English language, reaching the state-of-the-art in terms of both performance and accuracy.
7

References
[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language
model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
[2] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed processing.
Explorations in the microstructure of cognition, 2:216–271, 1986.
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations
in vector space. arXiv preprint arXiv:1301.3781, 2013.
[4] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernock`y. Empirical evaluation
and combination of advanced language modeling techniques. In INTERSPEECH, number s 1, pages 605–
608, 2011.
[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations
of words and phrases and their compositionality. In Advances in Neural Information Processing Systems,
pages 3111–3119, 2013.
[6] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word
representations. In HLT-NAACL, pages 746–751, 2013.
[7] Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
[8] Holger Schwenk. Continuous space language models. Computer Speech & Language, 21(3):492–518, 2007.
[9] Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. Combining heteroge-
neous models for measuring relational similarity. In HLT-NAACL, pages 1000–1009, 2013.
8

Word2vec models on Italian language experiments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Word2vec models on Italian language experiments

Similar to Word2vec models on Italian language experiments (20)

More from Vincenzo Lomonaco

More from Vincenzo Lomonaco (20)

Word2vec models on Italian language experiments