SlideShare a Scribd company logo
1 of 25
Download to read offline
Language Variety Identification using
Distributed Representations of Words and Documents
Marc Franco-Salvador, Francisco Rangel, Paolo Rosso,
Mariona Taulé, and M. Antònia Martí
mfranco@prhlt.upv.es, francisco.rangel@autoritas.es, prosso@dsic.upv.es,
{mtaule,amarti}@ub.edu
Introduction
“Author profiling aims to identify the linguistic
profile of an author on the basis of his writing
style.”
“Language variety identification is an author
profiling subtask which aims to detect lexical
and semantic variations in order to classify
different varieties of the same language.”
Example
The same sentence in varieties of Spanish:
“Estaba haciendo el tonto con mi perro y perdí el
móvil” (ES-SP)
“Estaba haciendo boludeces con mi perro y extravié el
celular” (ES-AR)
“Estaba haciendo el pendejo con mi perro y extravié el
celular” (ES-MX)
Translation:
“I was goofing around with my dog and I lost my
mobile” (EN)
Related work
● Zampieri and Gebre (2012) investigated varieties of Portuguese applying different
features such as word and character n-grams.
● Sadat et al. (2014) differentiated between six different varieties of Arabic in blogs
and forums using character n-grams.
● Maier and Gómez-Rodríguez (2014) employed meta-learning to classify tweets from
Argentina, Chile, Colombia, Mexico and Spain.
● Kríž et al. (2015) employed cross-entropy to detect English texts written for non-
native English speakers.
------------------------------------------------------------------------------------------
● Fabra-Boluda et al. (2015) NLEL_UPV_Autoritas participation at Discrimination
between Similar Languages (DSL) 2015 shared task
● Franco-Salvador et al. (2015) applied distributed representations of words and
documents to classify different varieties of European languages.
Related work
Tasks on language variety identification:
– Workshop on Language Technology for Closely Related
Languages and Language Variants at EMNLP2014.
– VarDial Workshop at COLING 20145 - Applying NLP Tools to
Similar Languages, Varieties and Dialects.
– T4VarDial - Joint Workshop on Language Technology for
Closely Related Languages, Varieties and Dialect (DSL)
shared task (Zampieri et al., 2014, 2015) at RANLP.
Proposed approach - motivation
The distributed representations of words capture
many linguistic regularities (Mikolov et al., 2013b):
vector('Paris') - vector('France') + vector('Italy')
is very close to
vector('Rome')
vector('king') - vector('man') + vector('woman')
is very close to
vector('queen')
Le and Mikolov (2014) employed distributed
representations of sentences to classify the polarity of
subjective text.
Distributed representation models
● Continuous bag-of-words (CBOW) model (Mikolov
et al., 2013b, 2013c).
– It maximizes the classification of a word in a text based
on the surrounding context (bag-of-words
representation).
– It is fast and maximizes the syntactic accuracy.
● Continuous skip-gram model (Mikolov et al.,
2013b, 2013c).
– It maximizes the classification of a word in a text based
on a close word. Distant words have less impact on the
prediction.
– It considerably maximizes the semantic accuracy.
Skip-gram model
Skip-gram model
The objective of the model is to maximize the
average of the log probability:
Conditional probability should be estimated
using the softmax function [Barto, 1998]:
Reminder:
Alternatives to softmax function
Negative sampling (Mikolov et al. 2013b)
It simplifies the Noise Contrastive Estimation (NCE)
(Gutmann and Hyvarinen, 2012) keeping the vector̈
quality.
“the task is to distinguish the target word from
a noise distribution using logistic
regression, where there are k negative samples
for each word.” (Mikolov et al. 2013b)
WO
Pn(w)
Generating distributed vectors of
sentences and documents
Two alternatives:
– Average the vectors of the words of a text (“Skip-
gram” in the evaluation)
e.g.: (vector('I') + vector('love') + vector('the') +
vector('capital') + vector('of') + vector('Bulgaria')) / 6
– Use directly the Sentence Vectors variation
(“SenVec” in the evaluation)
Generating distributed vectors of
sentences and documents
Two alternatives:
– Average the vectors of the words of a text (“Skip-
gram” in the evaluation)
e.g.: (vector('I') + vector('love') + vector('the') +
vector('capital') + vector('of') + vector('Bulgaria')) / 6
– Use directly the Sentence Vectors variation
(“SenVec” in the evaluation)
* We classified all the vectors using logistic
regression
Proposed alternatives
Author profiling models:
– Emotion-labeled Graphs (Rangel and Rosso, 2015)
(EmoGraphs)
– Information Gain Word-Patterns (Martí et al., 2015)
(IG-WP)
EmoGraph of “He estado tomando cursos en línea sobre
temas valiosos que disfruto estudiando y que podrían
ayudarme a hablar en público” ( “I have been taking online
courses about valuable subjects that I enjoy studying and might
help me to speak in public”)
Information Gain Word-Patterns
Information Gain Word-Patterns (IG-WP) (Martí
et al., 2015) obtains lexico-syntactic patterns
aiming to represent the content of documents.
The method is based on the pattern-
construction hypothesis:
– “those contexts that are relevant to the
definition of a cluster of semantically related
words tend to be (part of) lexico-syntactic
constructions”.
Information Gain Word-Patterns
Pattern structure:
Examples:
In the experiments we selected as features the set
of 1,000 words the obtained the patterns with the
highest information gain.
Dataset
We introduce the HispaBlogs1
dataset, a new
collection of Spanish blogs from five different
countries: Argentina, Chile, Mexico, Peru and
Spain.
There are 450 training and 200 testing blogs
respectively for each language variety.
Each user blog is represented by a set of user
posts, with 10 posts per user/blog.
1
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
Evaluation
We measured the accuracy of classification
comparing our approaches with several models and
baselines.
Author profiling models:
– EmoGraphs
– IG-WP
Baselines:
– Bag-of-words
– Character 4-grams
– TF-IDF 2-grams
– TF-IDF graphs
Experimental results
Test set confusion matrix (in %) of
Skip-gram model
Conclusions
● The use of distributed representations allows to
obtain competitive results in the task of
language variety identification in social media.
● The use of averages of vectors of words (Skip-
gram) or vectors of documents (SenVec)
provided similar results without significant
differences.
Future work
● We will investigate how to apply distributed
representations to other author profiling tasks
such as age and gender identification.
● We will continue working to improve the current
model in order to generate better distributed
representations for discriminating between
similar languages.
Thank you for your time :)
Questions / feedback?
francisco.rangel@autoritas.es
This work has been published at
Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martí, M. A. (2015).
Language variety identification using distributed representations of words and
documents. In Proceeding of the 6th International Conference of CLEF on
Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015),
volume LNCS(9283). Springer-Verlag.
References
Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.
Fabra-Boluda, R., Rangel, F., Rosso, P. (2015). NLEL_UPV_Autoritas participation at Discrimination
between Similar Languages (DSL) 2015 shared task. In: Proc. of the Joint Workshop on Language
Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.
Franco-Salvador, M. Rosso, P., & Rangel, F. (2015). Distributed Representations of Words and
Documents for Discriminating Similar Languages. In: Proc. of the Joint Workshop on Language
Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.
Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical
models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1),
307-361.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint
arXiv:1405.4053.
Maier, W., & Gómez-Rodrıguez, C. (2014). Language variety identification in Spanish tweets.
LT4CloseLang 2014, 25.
Martí, M.A., Bertran, M., Taulé, M., Salamó, M. (2015). Distributional approach based on syntactic
dependencies for discovering constructions. In Computational Linguistics (under review)
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in
vector space. In Proceedings of Workshop at ICLR.
References
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed representations of
words and phrases and their compositionality. In Advances in Neural Information Processing Systems
(pp. 3111-3119).
Morin, F., & Bengio, Y. (2005, January). Hierarchical probabilistic neural network language model. In
Proceedings of the international workshop on artificial intelligence and statistics (pp. 246-252).
Rangel, F., & Rosso, P. (2015). On the impact of emotions on author profiling. Information Processing &
Management.
Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and
Dialects in Social Media. SocialNLP 2014, 22.
Zampieri, M., & Gebre, B. G. (2012). Automatic identification of language varieties: The case of
Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing (pp. 233-237).
Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI).
Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014.
COLING 2014, 58.
Zampieri, M., Tan, L., Ljubešic, N., Tiedemann, J., & and Nakov, P. (2015). Overview of the dsl shared task
2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages,
Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

More Related Content

What's hot

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionSIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionKaterina Vylomova
 
"Thinking in English" information structures task array
"Thinking in English" information structures task array"Thinking in English" information structures task array
"Thinking in English" information structures task arrayLawrie Hunter
 
Publish perish as an instruction-end learning opportunity
Publish perish as an instruction-end learning opportunityPublish perish as an instruction-end learning opportunity
Publish perish as an instruction-end learning opportunityLawrie Hunter
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 
Invisible structures of technical writing
Invisible structures of technical writingInvisible structures of technical writing
Invisible structures of technical writingLawrie Hunter
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
PPT slides
PPT slidesPPT slides
PPT slidesbutest
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsTobias Kuhn
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Cognitum
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Lviv Data Science Summer School
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
 
What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...Isabelle Augenstein
 

What's hot (20)

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionSIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
 
"Thinking in English" information structures task array
"Thinking in English" information structures task array"Thinking in English" information structures task array
"Thinking in English" information structures task array
 
Publish perish as an instruction-end learning opportunity
Publish perish as an instruction-end learning opportunityPublish perish as an instruction-end learning opportunity
Publish perish as an instruction-end learning opportunity
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
Patterns of Value
Patterns of ValuePatterns of Value
Patterns of Value
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Invisible structures of technical writing
Invisible structures of technical writingInvisible structures of technical writing
Invisible structures of technical writing
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
PPT slides
PPT slidesPPT slides
PPT slides
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...What can typological knowledge bases and language representations tell us abo...
What can typological knowledge bases and language representations tell us abo...
 

Viewers also liked

UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTUREUNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTUREKarlaQuishpe
 
Language Change Part 1
Language Change Part 1Language Change Part 1
Language Change Part 1suasenglish
 
Language change timeline
Language change timelineLanguage change timeline
Language change timelineRobertagillum
 
Language Change Part 2: Labov Studies
Language Change Part 2: Labov StudiesLanguage Change Part 2: Labov Studies
Language Change Part 2: Labov Studiessuasenglish
 
Style Register and Dialect
Style Register and DialectStyle Register and Dialect
Style Register and DialectSidra Shahid
 
Types of language change
Types of language changeTypes of language change
Types of language changeMariam Bedraoui
 
Language varieties, dialect, register and style
Language varieties, dialect, register and styleLanguage varieties, dialect, register and style
Language varieties, dialect, register and styleM Ahlan Firdaus
 

Viewers also liked (13)

Use of language and author profiling.key
Use of language and author profiling.keyUse of language and author profiling.key
Use of language and author profiling.key
 
Language variety #1
Language variety #1Language variety #1
Language variety #1
 
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTUREUNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
 
Language Change Part 1
Language Change Part 1Language Change Part 1
Language Change Part 1
 
Language change timeline
Language change timelineLanguage change timeline
Language change timeline
 
language, dialect, varietes
language, dialect, varieteslanguage, dialect, varietes
language, dialect, varietes
 
Language Change Part 2: Labov Studies
Language Change Part 2: Labov StudiesLanguage Change Part 2: Labov Studies
Language Change Part 2: Labov Studies
 
Language Change
Language ChangeLanguage Change
Language Change
 
Style Register and Dialect
Style Register and DialectStyle Register and Dialect
Style Register and Dialect
 
secret of words
secret of wordssecret of words
secret of words
 
Types of language change
Types of language changeTypes of language change
Types of language change
 
Language varieties, dialect, register and style
Language varieties, dialect, register and styleLanguage varieties, dialect, register and style
Language varieties, dialect, register and style
 
Language change
Language changeLanguage change
Language change
 

Similar to Language Variety Identification using Distributed Representations of Words and Documents

Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYijnlc
 
Cmc Sig Leon Workshop Mh Rh 200409
Cmc Sig Leon Workshop Mh Rh 200409Cmc Sig Leon Workshop Mh Rh 200409
Cmc Sig Leon Workshop Mh Rh 200409lamericaana
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdf20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdfShingo Nahatame
 
An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...
An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...
An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...Angel Evans
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...James Cook University
 
Asld2011 maia pessoa_morgado_martins
Asld2011 maia pessoa_morgado_martinsAsld2011 maia pessoa_morgado_martins
Asld2011 maia pessoa_morgado_martinsYishay Mor
 
ppt for IASNLP.pptx
ppt for IASNLP.pptxppt for IASNLP.pptx
ppt for IASNLP.pptxbkmishra21
 
SENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGION
SENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGIONSENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGION
SENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGIONijnlc
 
Virtual Environment, Digital Hypertext, Reading and Writing in Foreign Language
Virtual Environment, Digital Hypertext, Reading and Writing in Foreign LanguageVirtual Environment, Digital Hypertext, Reading and Writing in Foreign Language
Virtual Environment, Digital Hypertext, Reading and Writing in Foreign LanguageElaine Teixeira
 

Similar to Language Variety Identification using Distributed Representations of Words and Documents (20)

Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
Cmc Sig Leon Workshop Mh Rh 200409
Cmc Sig Leon Workshop Mh Rh 200409Cmc Sig Leon Workshop Mh Rh 200409
Cmc Sig Leon Workshop Mh Rh 200409
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdf20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdf
 
An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...
An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...
An Update On Discourse Functions And Syntactic Complexity In Synchronous And ...
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
Presentation1.ppt
Presentation1.pptPresentation1.ppt
Presentation1.ppt
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...Digital discourse markers in an ESL learning setting: The case of socialisati...
Digital discourse markers in an ESL learning setting: The case of socialisati...
 
Asld2011 maia pessoa_morgado_martins
Asld2011 maia pessoa_morgado_martinsAsld2011 maia pessoa_morgado_martins
Asld2011 maia pessoa_morgado_martins
 
ppt for IASNLP.pptx
ppt for IASNLP.pptxppt for IASNLP.pptx
ppt for IASNLP.pptx
 
SENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGION
SENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGIONSENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGION
SENTENCE-LEVEL DIALECTS IDENTIFICATION IN THE GREATER CHINA REGION
 
Virtual Environment, Digital Hypertext, Reading and Writing in Foreign Language
Virtual Environment, Digital Hypertext, Reading and Writing in Foreign LanguageVirtual Environment, Digital Hypertext, Reading and Writing in Foreign Language
Virtual Environment, Digital Hypertext, Reading and Writing in Foreign Language
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
DH_syllabus_typology
DH_syllabus_typologyDH_syllabus_typology
DH_syllabus_typology
 

More from Francisco Manuel Rangel Pardo

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Francisco Manuel Rangel Pardo
 
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Francisco Manuel Rangel Pardo
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Francisco Manuel Rangel Pardo
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019Francisco Manuel Rangel Pardo
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Francisco Manuel Rangel Pardo
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Francisco Manuel Rangel Pardo
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Francisco Manuel Rangel Pardo
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIREFrancisco Manuel Rangel Pardo
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Francisco Manuel Rangel Pardo
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Francisco Manuel Rangel Pardo
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustFrancisco Manuel Rangel Pardo
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)Francisco Manuel Rangel Pardo
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Francisco Manuel Rangel Pardo
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...Francisco Manuel Rangel Pardo
 

More from Francisco Manuel Rangel Pardo (20)

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRE
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
 
Redes sociales y preadolescentes
Redes sociales y preadolescentesRedes sociales y preadolescentes
Redes sociales y preadolescentes
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Language Variety Identification using Distributed Representations of Words and Documents

  • 1. Language Variety Identification using Distributed Representations of Words and Documents Marc Franco-Salvador, Francisco Rangel, Paolo Rosso, Mariona Taulé, and M. Antònia Martí mfranco@prhlt.upv.es, francisco.rangel@autoritas.es, prosso@dsic.upv.es, {mtaule,amarti}@ub.edu
  • 2. Introduction “Author profiling aims to identify the linguistic profile of an author on the basis of his writing style.” “Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language.”
  • 3. Example The same sentence in varieties of Spanish: “Estaba haciendo el tonto con mi perro y perdí el móvil” (ES-SP) “Estaba haciendo boludeces con mi perro y extravié el celular” (ES-AR) “Estaba haciendo el pendejo con mi perro y extravié el celular” (ES-MX) Translation: “I was goofing around with my dog and I lost my mobile” (EN)
  • 4. Related work ● Zampieri and Gebre (2012) investigated varieties of Portuguese applying different features such as word and character n-grams. ● Sadat et al. (2014) differentiated between six different varieties of Arabic in blogs and forums using character n-grams. ● Maier and Gómez-Rodríguez (2014) employed meta-learning to classify tweets from Argentina, Chile, Colombia, Mexico and Spain. ● Kríž et al. (2015) employed cross-entropy to detect English texts written for non- native English speakers. ------------------------------------------------------------------------------------------ ● Fabra-Boluda et al. (2015) NLEL_UPV_Autoritas participation at Discrimination between Similar Languages (DSL) 2015 shared task ● Franco-Salvador et al. (2015) applied distributed representations of words and documents to classify different varieties of European languages.
  • 5. Related work Tasks on language variety identification: – Workshop on Language Technology for Closely Related Languages and Language Variants at EMNLP2014. – VarDial Workshop at COLING 20145 - Applying NLP Tools to Similar Languages, Varieties and Dialects. – T4VarDial - Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialect (DSL) shared task (Zampieri et al., 2014, 2015) at RANLP.
  • 6. Proposed approach - motivation The distributed representations of words capture many linguistic regularities (Mikolov et al., 2013b): vector('Paris') - vector('France') + vector('Italy') is very close to vector('Rome') vector('king') - vector('man') + vector('woman') is very close to vector('queen') Le and Mikolov (2014) employed distributed representations of sentences to classify the polarity of subjective text.
  • 7. Distributed representation models ● Continuous bag-of-words (CBOW) model (Mikolov et al., 2013b, 2013c). – It maximizes the classification of a word in a text based on the surrounding context (bag-of-words representation). – It is fast and maximizes the syntactic accuracy. ● Continuous skip-gram model (Mikolov et al., 2013b, 2013c). – It maximizes the classification of a word in a text based on a close word. Distant words have less impact on the prediction. – It considerably maximizes the semantic accuracy.
  • 9. Skip-gram model The objective of the model is to maximize the average of the log probability: Conditional probability should be estimated using the softmax function [Barto, 1998]: Reminder:
  • 10. Alternatives to softmax function Negative sampling (Mikolov et al. 2013b) It simplifies the Noise Contrastive Estimation (NCE) (Gutmann and Hyvarinen, 2012) keeping the vector̈ quality. “the task is to distinguish the target word from a noise distribution using logistic regression, where there are k negative samples for each word.” (Mikolov et al. 2013b) WO Pn(w)
  • 11. Generating distributed vectors of sentences and documents Two alternatives: – Average the vectors of the words of a text (“Skip- gram” in the evaluation) e.g.: (vector('I') + vector('love') + vector('the') + vector('capital') + vector('of') + vector('Bulgaria')) / 6 – Use directly the Sentence Vectors variation (“SenVec” in the evaluation)
  • 12. Generating distributed vectors of sentences and documents Two alternatives: – Average the vectors of the words of a text (“Skip- gram” in the evaluation) e.g.: (vector('I') + vector('love') + vector('the') + vector('capital') + vector('of') + vector('Bulgaria')) / 6 – Use directly the Sentence Vectors variation (“SenVec” in the evaluation) * We classified all the vectors using logistic regression
  • 13. Proposed alternatives Author profiling models: – Emotion-labeled Graphs (Rangel and Rosso, 2015) (EmoGraphs) – Information Gain Word-Patterns (Martí et al., 2015) (IG-WP)
  • 14. EmoGraph of “He estado tomando cursos en línea sobre temas valiosos que disfruto estudiando y que podrían ayudarme a hablar en público” ( “I have been taking online courses about valuable subjects that I enjoy studying and might help me to speak in public”)
  • 15. Information Gain Word-Patterns Information Gain Word-Patterns (IG-WP) (Martí et al., 2015) obtains lexico-syntactic patterns aiming to represent the content of documents. The method is based on the pattern- construction hypothesis: – “those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions”.
  • 16. Information Gain Word-Patterns Pattern structure: Examples: In the experiments we selected as features the set of 1,000 words the obtained the patterns with the highest information gain.
  • 17. Dataset We introduce the HispaBlogs1 dataset, a new collection of Spanish blogs from five different countries: Argentina, Chile, Mexico, Peru and Spain. There are 450 training and 200 testing blogs respectively for each language variety. Each user blog is represented by a set of user posts, with 10 posts per user/blog. 1 https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
  • 18. Evaluation We measured the accuracy of classification comparing our approaches with several models and baselines. Author profiling models: – EmoGraphs – IG-WP Baselines: – Bag-of-words – Character 4-grams – TF-IDF 2-grams – TF-IDF graphs
  • 20. Test set confusion matrix (in %) of Skip-gram model
  • 21. Conclusions ● The use of distributed representations allows to obtain competitive results in the task of language variety identification in social media. ● The use of averages of vectors of words (Skip- gram) or vectors of documents (SenVec) provided similar results without significant differences.
  • 22. Future work ● We will investigate how to apply distributed representations to other author profiling tasks such as age and gender identification. ● We will continue working to improve the current model in order to generate better distributed representations for discriminating between similar languages.
  • 23. Thank you for your time :) Questions / feedback? francisco.rangel@autoritas.es This work has been published at Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martí, M. A. (2015). Language variety identification using distributed representations of words and documents. In Proceeding of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015), volume LNCS(9283). Springer-Verlag.
  • 24. References Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. Fabra-Boluda, R., Rangel, F., Rosso, P. (2015). NLEL_UPV_Autoritas participation at Discrimination between Similar Languages (DSL) 2015 shared task. In: Proc. of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria. Franco-Salvador, M. Rosso, P., & Rangel, F. (2015). Distributed Representations of Words and Documents for Discriminating Similar Languages. In: Proc. of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria. Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1), 307-361. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053. Maier, W., & Gómez-Rodrıguez, C. (2014). Language variety identification in Spanish tweets. LT4CloseLang 2014, 25. Martí, M.A., Bertran, M., Taulé, M., Salamó, M. (2015). Distributional approach based on syntactic dependencies for discovering constructions. In Computational Linguistics (under review) Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR.
  • 25. References Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119). Morin, F., & Bengio, Y. (2005, January). Hierarchical probabilistic neural network language model. In Proceedings of the international workshop on artificial intelligence and statistics (pp. 246-252). Rangel, F., & Rosso, P. (2015). On the impact of emotions on author profiling. Information Processing & Management. Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. SocialNLP 2014, 22. Zampieri, M., & Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing (pp. 233-237). Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI). Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014. COLING 2014, 58. Zampieri, M., Tan, L., Ljubešic, N., Tiedemann, J., & and Nakov, P. (2015). Overview of the dsl shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.