5th Author Profiling task at PAN
Gender and Language Variety
Identification in Twitter
PAN-AP-2017 CLEF 2017
Dublin, 11-14 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Politècnica de València
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Martin Potthast & Benno Stein
Bauhaus-Universität Weimar
Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN’16AuthorProfiling
Task goal
To investigate the identification of
author’s gender and language
variety together.
3
PAN’16AuthorProfiling
Four languages:
English Spanish PortugueseArabic
Corpus collection
4
PAN’16AuthorProfiling
● Step 1: Languages and varieties selection.
● Step 2: Tweets per region retrieval.
Corpus collection
5
PAN’16AuthorProfiling
● Step 3: Unique authors identification.
● Step 4: Authors selection:
○ Tweets are not retweets.
○ Tweets are written in the corresponding language.
● Step 5: Language variety annotation:
○ 80% of tweet meta-data coincide with:
■ Geotagging.
■ Toponyms of the region.
● Step 6: Gender annotation:
○ Automatically: dictionary of proper nouns.
○ Manually: visual review.
Corpus
6
PAN’16AuthorProfiling
● Step 7: Corpus construction:
○ 500 authors per variety and gender.
■ 300 for training, 200 for test.
○ 100 tweets per author.
The accuracy is calculated per task and language.
Then, the averages per task are calculated:
Finally, the ranking is the global average:
Evaluation measures
7
PAN’16AuthorProfiling
Baselines
8
PAN’16AuthorProfiling
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.
22 participants
20 working notes
19 countries 9
PAN’16AuthorProfiling
Qatar
Netherlands
Cuba
Slovenia
Approaches
10
PAN’16AuthorProfiling
Approaches - Preprocessing
11
PAN’16AuthorProfiling
HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira
Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti
Stop words Kheng et al.; Martinc et al.
Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al.
Remove short tweets Kheng et al.
Twitter specific components:
hashtags, urls, mentions and RTs
Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.;
Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti
Out-of-alphabet words Schaetti
Expand contractions Adame et al.
Approaches - Features
12
PAN’16AuthorProfiling
Stylistic features:
- Ratios of links
- Hashtag or user mentions
- Character flooding
- Emoticons / laugher expressions
- Domain names
Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame
et al.; Markov et al.
Emotional features:
● Emotions
● Appraisal
● Admiration
● Pos/neg emoticons
● Sentiment words
● ...
Adame et al.; Martinc et al.
Specific lists of words, most
discriminant words, ..
Martinc et al.; Kocher & Savoy; Khan
Approaches - Features
13
PAN’16AuthorProfiling
N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.;
Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti;
Ciobanu et al.
Bag-of-words Adame et al.; Tellez et al.
Tf-idf n-grams Poulston et al.; Schaetti; Basile et al.
LSA Kheng et al.
Second order representation Pastor et al.
Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et
al.
Character embeddings Franco-Salvador et al.; Miura et al.
Approaches - Methods
14
PAN’16AuthorProfiling
Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov
SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile
et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.;
Naive Bayes Kheng et al.
Distance-based approaches Adame et al.; Kocher & Savoy; Khan
Recurrent Neural Networks Kodiyan et al.; Miura et al.
Convolutional Neural
Networks
Schaetti; Sierra et al.; Miura et al.
Deep Averaging Networks Franco-Salvador et al.
Gender results
15
PAN’16AuthorProfiling
Variety results
16
PAN’16AuthorProfiling
Confusion among varieties (AR)
17
PAN’16AuthorProfiling
Confusion among varieties (PT)
18
PAN’16AuthorProfiling
Confusion among varieties (ES)
19
PAN’16AuthorProfiling
Confusion among varieties (EN)
20
PAN’16AuthorProfiling
Coarse vs. fine grained English
21
PAN’16AuthorProfiling
● American: United States + Canada.
● European: Great Britain + Ireland.
● Oceanic: New Zealand + Australia.
The impact of the Gender in Variety Identification
22
PAN’16AuthorProfiling
● All participants’ predictions together.
● Except in Spanish, it is less difficult to predict the variety when the
author is a female.
The difficulty of Gender Id. depending on Variety
23
PAN’16AuthorProfiling
● All participants’ predictions together.
● For most Arabic and Portuguese varieties, females are less difficult to be identified.
● In case of Spanish and English both genders are similarly difficult to be identified.
Joint evaluation
24
PAN’16AuthorProfiling
Final ranking
25
PAN’16AuthorProfiling
*
26
PAN’16AuthorProfiling
PAN-AP 2017 best results
Conclusions
● High combination of features: content-based, stylometric, n-grams, … and for the first time deep
learning approaches have been widely used.
○ Deep learning approaches did not obtain the best results.
● Per language:
○ The best results have been obtained in Portuguese.
○ The average worst results in gender identification have been obtained in Arabic.
○ The average worst results in language variety identification have been obtained in English.
● Per variety:
○ In Arabic: The most difficult Gulf. The easiest Levantine.
○ In English, the highest confusion occurs among varieties which share regional locations.
○ In Spanish, most confusions through Colombia. The highest confusion is from Peru.
○ Portuguese is asymetric: Highest confusions from Portugal to Brazil.
● Coarse vs. fine-grained evaluation in English:
○ Significant differences, although not very high (3.75%) in the case of the best approaches.
● The impact of the gender in the language variety identification:
○ In Arabic and Portuguese the differences among genders are significant.
● The difficulty of gender identification depending on the language variety:
○ For most Arabic and Portuguese varieties, females are less difficult to be identified.
○ In case of Spanish and English both genders are similarly difficult to be identified.
27
PAN’16AuthorProfiling
Task impact
28
PAN’16AuthorProfiling
PARTICIPANTS COUNTRIES CITATIONS
PAN-AP 2013
21 16 67 (+28)
PAN-AP 2014
10 8 41 (+25)
PAN-AP 2015
22 13 42 (+25)
PAN-AP 2016
22 15 5
PAN-AP 2017
22 19
Next year?
29
PAN’16AuthorProfiling
Industry at PAN (Author Profiling)
30
PAN’16AuthorProfiling
Organisation Sponsors
Participants
31
PAN’16AuthorProfiling
On behalf of the author profiling task organisers:
Thank you very much for participating
and hope to see you next year!!

Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

  • 1.
    5th Author Profilingtask at PAN Gender and Language Variety Identification in Twitter PAN-AP-2017 CLEF 2017 Dublin, 11-14 September Francisco Rangel Autoritas Consulting & PRHLT Research Center - Universitat Politècnica de València Paolo Rosso PRHLT Research Center Universitat Politècnica de Valencia Martin Potthast & Benno Stein Bauhaus-Universität Weimar
  • 2.
    Introduction Author profiling aimsat identifying personal traits such as age, gender, personality traits, native language, language variety… from writings. This is crucial for: - Marketing - Security - Forensics 2 PAN’16AuthorProfiling
  • 3.
    Task goal To investigatethe identification of author’s gender and language variety together. 3 PAN’16AuthorProfiling Four languages: English Spanish PortugueseArabic
  • 4.
    Corpus collection 4 PAN’16AuthorProfiling ● Step1: Languages and varieties selection. ● Step 2: Tweets per region retrieval.
  • 5.
    Corpus collection 5 PAN’16AuthorProfiling ● Step3: Unique authors identification. ● Step 4: Authors selection: ○ Tweets are not retweets. ○ Tweets are written in the corresponding language. ● Step 5: Language variety annotation: ○ 80% of tweet meta-data coincide with: ■ Geotagging. ■ Toponyms of the region. ● Step 6: Gender annotation: ○ Automatically: dictionary of proper nouns. ○ Manually: visual review.
  • 6.
    Corpus 6 PAN’16AuthorProfiling ● Step 7:Corpus construction: ○ 500 authors per variety and gender. ■ 300 for training, 200 for test. ○ 100 tweets per author.
  • 7.
    The accuracy iscalculated per task and language. Then, the averages per task are calculated: Finally, the ranking is the global average: Evaluation measures 7 PAN’16AuthorProfiling
  • 8.
    Baselines 8 PAN’16AuthorProfiling ● BASELINE-stat: Astatistical baseline that emulates random choice. ● BASELINE-bow: ○ Documents represented as bag-of-words. ○ The 1,000 most common words in the training set. ○ Weighted by absolute frequency. ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. ● BASELINE-LDR: ○ Documents represented by the probability distribution of occurrence of their words in the different classes. ○ Each word is weighted depending on its probability of belonging to each class. ○ The distribution of weights for a given document should be closer to the weights of its corresponding class.
  • 9.
    22 participants 20 workingnotes 19 countries 9 PAN’16AuthorProfiling Qatar Netherlands Cuba Slovenia
  • 10.
  • 11.
    Approaches - Preprocessing 11 PAN’16AuthorProfiling HTMLcleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti Stop words Kheng et al.; Martinc et al. Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al. Remove short tweets Kheng et al. Twitter specific components: hashtags, urls, mentions and RTs Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.; Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti Out-of-alphabet words Schaetti Expand contractions Adame et al.
  • 12.
    Approaches - Features 12 PAN’16AuthorProfiling Stylisticfeatures: - Ratios of links - Hashtag or user mentions - Character flooding - Emoticons / laugher expressions - Domain names Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame et al.; Markov et al. Emotional features: ● Emotions ● Appraisal ● Admiration ● Pos/neg emoticons ● Sentiment words ● ... Adame et al.; Martinc et al. Specific lists of words, most discriminant words, .. Martinc et al.; Kocher & Savoy; Khan
  • 13.
    Approaches - Features 13 PAN’16AuthorProfiling N-grammodels Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.; Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti; Ciobanu et al. Bag-of-words Adame et al.; Tellez et al. Tf-idf n-grams Poulston et al.; Schaetti; Basile et al. LSA Kheng et al. Second order representation Pastor et al. Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et al. Character embeddings Franco-Salvador et al.; Miura et al.
  • 14.
    Approaches - Methods 14 PAN’16AuthorProfiling Logisticregression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.; Naive Bayes Kheng et al. Distance-based approaches Adame et al.; Kocher & Savoy; Khan Recurrent Neural Networks Kodiyan et al.; Miura et al. Convolutional Neural Networks Schaetti; Sierra et al.; Miura et al. Deep Averaging Networks Franco-Salvador et al.
  • 15.
  • 16.
  • 17.
    Confusion among varieties(AR) 17 PAN’16AuthorProfiling
  • 18.
    Confusion among varieties(PT) 18 PAN’16AuthorProfiling
  • 19.
    Confusion among varieties(ES) 19 PAN’16AuthorProfiling
  • 20.
    Confusion among varieties(EN) 20 PAN’16AuthorProfiling
  • 21.
    Coarse vs. finegrained English 21 PAN’16AuthorProfiling ● American: United States + Canada. ● European: Great Britain + Ireland. ● Oceanic: New Zealand + Australia.
  • 22.
    The impact ofthe Gender in Variety Identification 22 PAN’16AuthorProfiling ● All participants’ predictions together. ● Except in Spanish, it is less difficult to predict the variety when the author is a female.
  • 23.
    The difficulty ofGender Id. depending on Variety 23 PAN’16AuthorProfiling ● All participants’ predictions together. ● For most Arabic and Portuguese varieties, females are less difficult to be identified. ● In case of Spanish and English both genders are similarly difficult to be identified.
  • 24.
  • 25.
  • 26.
  • 27.
    Conclusions ● High combinationof features: content-based, stylometric, n-grams, … and for the first time deep learning approaches have been widely used. ○ Deep learning approaches did not obtain the best results. ● Per language: ○ The best results have been obtained in Portuguese. ○ The average worst results in gender identification have been obtained in Arabic. ○ The average worst results in language variety identification have been obtained in English. ● Per variety: ○ In Arabic: The most difficult Gulf. The easiest Levantine. ○ In English, the highest confusion occurs among varieties which share regional locations. ○ In Spanish, most confusions through Colombia. The highest confusion is from Peru. ○ Portuguese is asymetric: Highest confusions from Portugal to Brazil. ● Coarse vs. fine-grained evaluation in English: ○ Significant differences, although not very high (3.75%) in the case of the best approaches. ● The impact of the gender in the language variety identification: ○ In Arabic and Portuguese the differences among genders are significant. ● The difficulty of gender identification depending on the language variety: ○ For most Arabic and Portuguese varieties, females are less difficult to be identified. ○ In case of Spanish and English both genders are similarly difficult to be identified. 27 PAN’16AuthorProfiling
  • 28.
    Task impact 28 PAN’16AuthorProfiling PARTICIPANTS COUNTRIESCITATIONS PAN-AP 2013 21 16 67 (+28) PAN-AP 2014 10 8 41 (+25) PAN-AP 2015 22 13 42 (+25) PAN-AP 2016 22 15 5 PAN-AP 2017 22 19
  • 29.
  • 30.
    Industry at PAN(Author Profiling) 30 PAN’16AuthorProfiling Organisation Sponsors Participants
  • 31.
    31 PAN’16AuthorProfiling On behalf ofthe author profiling task organisers: Thank you very much for participating and hope to see you next year!!