Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language. In this work we approach the task by using distributed representations based on Mikolov et al. investigations.
Language Variety Identification using Distributed Representations of Words and Documents
1. Language Variety Identification using
Distributed Representations of Words and Documents
Marc Franco-Salvador, Francisco Rangel, Paolo Rosso,
Mariona Taulé, and M. Antònia Martí
mfranco@prhlt.upv.es, francisco.rangel@autoritas.es, prosso@dsic.upv.es,
{mtaule,amarti}@ub.edu
2. Introduction
“Author profiling aims to identify the linguistic
profile of an author on the basis of his writing
style.”
“Language variety identification is an author
profiling subtask which aims to detect lexical
and semantic variations in order to classify
different varieties of the same language.”
3. Example
The same sentence in varieties of Spanish:
“Estaba haciendo el tonto con mi perro y perdí el
móvil” (ES-SP)
“Estaba haciendo boludeces con mi perro y extravié el
celular” (ES-AR)
“Estaba haciendo el pendejo con mi perro y extravié el
celular” (ES-MX)
Translation:
“I was goofing around with my dog and I lost my
mobile” (EN)
4. Related work
● Zampieri and Gebre (2012) investigated varieties of Portuguese applying different
features such as word and character n-grams.
● Sadat et al. (2014) differentiated between six different varieties of Arabic in blogs
and forums using character n-grams.
● Maier and Gómez-Rodríguez (2014) employed meta-learning to classify tweets from
Argentina, Chile, Colombia, Mexico and Spain.
● Kríž et al. (2015) employed cross-entropy to detect English texts written for non-
native English speakers.
------------------------------------------------------------------------------------------
● Fabra-Boluda et al. (2015) NLEL_UPV_Autoritas participation at Discrimination
between Similar Languages (DSL) 2015 shared task
● Franco-Salvador et al. (2015) applied distributed representations of words and
documents to classify different varieties of European languages.
5. Related work
Tasks on language variety identification:
– Workshop on Language Technology for Closely Related
Languages and Language Variants at EMNLP2014.
– VarDial Workshop at COLING 20145 - Applying NLP Tools to
Similar Languages, Varieties and Dialects.
– T4VarDial - Joint Workshop on Language Technology for
Closely Related Languages, Varieties and Dialect (DSL)
shared task (Zampieri et al., 2014, 2015) at RANLP.
6. Proposed approach - motivation
The distributed representations of words capture
many linguistic regularities (Mikolov et al., 2013b):
vector('Paris') - vector('France') + vector('Italy')
is very close to
vector('Rome')
vector('king') - vector('man') + vector('woman')
is very close to
vector('queen')
Le and Mikolov (2014) employed distributed
representations of sentences to classify the polarity of
subjective text.
7. Distributed representation models
● Continuous bag-of-words (CBOW) model (Mikolov
et al., 2013b, 2013c).
– It maximizes the classification of a word in a text based
on the surrounding context (bag-of-words
representation).
– It is fast and maximizes the syntactic accuracy.
● Continuous skip-gram model (Mikolov et al.,
2013b, 2013c).
– It maximizes the classification of a word in a text based
on a close word. Distant words have less impact on the
prediction.
– It considerably maximizes the semantic accuracy.
9. Skip-gram model
The objective of the model is to maximize the
average of the log probability:
Conditional probability should be estimated
using the softmax function [Barto, 1998]:
Reminder:
10. Alternatives to softmax function
Negative sampling (Mikolov et al. 2013b)
It simplifies the Noise Contrastive Estimation (NCE)
(Gutmann and Hyvarinen, 2012) keeping the vector̈
quality.
“the task is to distinguish the target word from
a noise distribution using logistic
regression, where there are k negative samples
for each word.” (Mikolov et al. 2013b)
WO
Pn(w)
11. Generating distributed vectors of
sentences and documents
Two alternatives:
– Average the vectors of the words of a text (“Skip-
gram” in the evaluation)
e.g.: (vector('I') + vector('love') + vector('the') +
vector('capital') + vector('of') + vector('Bulgaria')) / 6
– Use directly the Sentence Vectors variation
(“SenVec” in the evaluation)
12. Generating distributed vectors of
sentences and documents
Two alternatives:
– Average the vectors of the words of a text (“Skip-
gram” in the evaluation)
e.g.: (vector('I') + vector('love') + vector('the') +
vector('capital') + vector('of') + vector('Bulgaria')) / 6
– Use directly the Sentence Vectors variation
(“SenVec” in the evaluation)
* We classified all the vectors using logistic
regression
13. Proposed alternatives
Author profiling models:
– Emotion-labeled Graphs (Rangel and Rosso, 2015)
(EmoGraphs)
– Information Gain Word-Patterns (Martí et al., 2015)
(IG-WP)
14. EmoGraph of “He estado tomando cursos en línea sobre
temas valiosos que disfruto estudiando y que podrían
ayudarme a hablar en público” ( “I have been taking online
courses about valuable subjects that I enjoy studying and might
help me to speak in public”)
15. Information Gain Word-Patterns
Information Gain Word-Patterns (IG-WP) (Martí
et al., 2015) obtains lexico-syntactic patterns
aiming to represent the content of documents.
The method is based on the pattern-
construction hypothesis:
– “those contexts that are relevant to the
definition of a cluster of semantically related
words tend to be (part of) lexico-syntactic
constructions”.
16. Information Gain Word-Patterns
Pattern structure:
Examples:
In the experiments we selected as features the set
of 1,000 words the obtained the patterns with the
highest information gain.
17. Dataset
We introduce the HispaBlogs1
dataset, a new
collection of Spanish blogs from five different
countries: Argentina, Chile, Mexico, Peru and
Spain.
There are 450 training and 200 testing blogs
respectively for each language variety.
Each user blog is represented by a set of user
posts, with 10 posts per user/blog.
1
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
18. Evaluation
We measured the accuracy of classification
comparing our approaches with several models and
baselines.
Author profiling models:
– EmoGraphs
– IG-WP
Baselines:
– Bag-of-words
– Character 4-grams
– TF-IDF 2-grams
– TF-IDF graphs
21. Conclusions
● The use of distributed representations allows to
obtain competitive results in the task of
language variety identification in social media.
● The use of averages of vectors of words (Skip-
gram) or vectors of documents (SenVec)
provided similar results without significant
differences.
22. Future work
● We will investigate how to apply distributed
representations to other author profiling tasks
such as age and gender identification.
● We will continue working to improve the current
model in order to generate better distributed
representations for discriminating between
similar languages.
23. Thank you for your time :)
Questions / feedback?
francisco.rangel@autoritas.es
This work has been published at
Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martí, M. A. (2015).
Language variety identification using distributed representations of words and
documents. In Proceeding of the 6th International Conference of CLEF on
Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015),
volume LNCS(9283). Springer-Verlag.
24. References
Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.
Fabra-Boluda, R., Rangel, F., Rosso, P. (2015). NLEL_UPV_Autoritas participation at Discrimination
between Similar Languages (DSL) 2015 shared task. In: Proc. of the Joint Workshop on Language
Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.
Franco-Salvador, M. Rosso, P., & Rangel, F. (2015). Distributed Representations of Words and
Documents for Discriminating Similar Languages. In: Proc. of the Joint Workshop on Language
Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.
Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical
models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1),
307-361.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint
arXiv:1405.4053.
Maier, W., & Gómez-Rodrıguez, C. (2014). Language variety identification in Spanish tweets.
LT4CloseLang 2014, 25.
Martí, M.A., Bertran, M., Taulé, M., Salamó, M. (2015). Distributional approach based on syntactic
dependencies for discovering constructions. In Computational Linguistics (under review)
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in
vector space. In Proceedings of Workshop at ICLR.
25. References
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed representations of
words and phrases and their compositionality. In Advances in Neural Information Processing Systems
(pp. 3111-3119).
Morin, F., & Bengio, Y. (2005, January). Hierarchical probabilistic neural network language model. In
Proceedings of the international workshop on artificial intelligence and statistics (pp. 246-252).
Rangel, F., & Rosso, P. (2015). On the impact of emotions on author profiling. Information Processing &
Management.
Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and
Dialects in Social Media. SocialNLP 2014, 22.
Zampieri, M., & Gebre, B. G. (2012). Automatic identification of language varieties: The case of
Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing (pp. 233-237).
Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI).
Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014.
COLING 2014, 58.
Zampieri, M., Tan, L., Ljubešic, N., Tiedemann, J., & and Nakov, P. (2015). Overview of the dsl shared task
2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages,
Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.