Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

5th Author Profiling task at PAN
Gender and Language Variety
Identification in Twitter
PAN-AP-2017 CLEF 2017
Dublin, 11-14 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Politècnica de València
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Martin Potthast & Benno Stein
Bauhaus-Universität Weimar

Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN’16AuthorProfiling

Task goal
To investigate the identification of
author’s gender and language
variety together.
3
Four languages:
English Spanish PortugueseArabic

Corpus collection
4
● Step 1: Languages and varieties selection.
● Step 2: Tweets per region retrieval.

Corpus collection
5
● Step 3: Unique authors identification.
● Step 4: Authors selection:
○ Tweets are not retweets.
○ Tweets are written in the corresponding language.
● Step 5: Language variety annotation:
○ 80% of tweet meta-data coincide with:
■ Geotagging.
■ Toponyms of the region.
● Step 6: Gender annotation:
○ Automatically: dictionary of proper nouns.
○ Manually: visual review.

Corpus
6
● Step 7: Corpus construction:
○ 500 authors per variety and gender.
■ 300 for training, 200 for test.
○ 100 tweets per author.

The accuracy is calculated per task and language.
Then, the averages per task are calculated:
Finally, the ranking is the global average:
Evaluation measures
7

Baselines
8
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.

22 participants
20 working notes
19 countries 9
Qatar
Netherlands
Cuba
Slovenia

Approaches
10

Approaches - Preprocessing
11
HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira
Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti
Stop words Kheng et al.; Martinc et al.
Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al.
Remove short tweets Kheng et al.
Twitter specific components:
hashtags, urls, mentions and RTs
Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.;
Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti
Out-of-alphabet words Schaetti
Expand contractions Adame et al.

Approaches - Features
12
Stylistic features:
- Ratios of links
- Hashtag or user mentions
- Character flooding
- Emoticons / laugher expressions
- Domain names
Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame
et al.; Markov et al.
Emotional features:
● Emotions
● Appraisal
● Admiration
● Pos/neg emoticons
● Sentiment words
● ...
Adame et al.; Martinc et al.
Specific lists of words, most
discriminant words, ..
Martinc et al.; Kocher & Savoy; Khan

Approaches - Features
13
N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.;
Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti;
Ciobanu et al.
Bag-of-words Adame et al.; Tellez et al.
Tf-idf n-grams Poulston et al.; Schaetti; Basile et al.
LSA Kheng et al.
Second order representation Pastor et al.
Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et
al.
Character embeddings Franco-Salvador et al.; Miura et al.

Approaches - Methods
14
Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov
SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile
et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.;
Naive Bayes Kheng et al.
Distance-based approaches Adame et al.; Kocher & Savoy; Khan
Recurrent Neural Networks Kodiyan et al.; Miura et al.
Convolutional Neural
Networks
Schaetti; Sierra et al.; Miura et al.
Deep Averaging Networks Franco-Salvador et al.

Gender results
15

Variety results
16

Confusion among varieties (AR)
17

Confusion among varieties (PT)
18

Confusion among varieties (ES)
19

Confusion among varieties (EN)
20

Coarse vs. fine grained English
21
● American: United States + Canada.
● European: Great Britain + Ireland.
● Oceanic: New Zealand + Australia.

The impact of the Gender in Variety Identification
22
● All participants’ predictions together.
● Except in Spanish, it is less difficult to predict the variety when the
author is a female.

The difficulty of Gender Id. depending on Variety
23
● All participants’ predictions together.
● For most Arabic and Portuguese varieties, females are less difficult to be identified.
● In case of Spanish and English both genders are similarly difficult to be identified.

Joint evaluation
24

Final ranking
25
*

26
PAN-AP 2017 best results

Conclusions
● High combination of features: content-based, stylometric, n-grams, … and for the first time deep
learning approaches have been widely used.
○ Deep learning approaches did not obtain the best results.
● Per language:
○ The best results have been obtained in Portuguese.
○ The average worst results in gender identification have been obtained in Arabic.
○ The average worst results in language variety identification have been obtained in English.
● Per variety:
○ In Arabic: The most difficult Gulf. The easiest Levantine.
○ In English, the highest confusion occurs among varieties which share regional locations.
○ In Spanish, most confusions through Colombia. The highest confusion is from Peru.
○ Portuguese is asymetric: Highest confusions from Portugal to Brazil.
● Coarse vs. fine-grained evaluation in English:
○ Significant differences, although not very high (3.75%) in the case of the best approaches.
● The impact of the gender in the language variety identification:
○ In Arabic and Portuguese the differences among genders are significant.
● The difficulty of gender identification depending on the language variety:
○ For most Arabic and Portuguese varieties, females are less difficult to be identified.
○ In case of Spanish and English both genders are similarly difficult to be identified.
27

Task impact
28
PARTICIPANTS COUNTRIES CITATIONS
PAN-AP 2013
21 16 67 (+28)
PAN-AP 2014
10 8 41 (+25)
PAN-AP 2015
22 13 42 (+25)
PAN-AP 2016
22 15 5
PAN-AP 2017
22 19

Next year?
29

Industry at PAN (Author Profiling)
30
Organisation Sponsors
Participants

31
On behalf of the author profiling task organisers:
Thank you very much for participating
and hope to see you next year!!

Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

More Related Content

Similar to Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

More from Francisco Manuel Rangel Pardo

Recently uploaded

Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.