Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RusProfiling Gender Identification in Russian Texts PAN@FIRE

115 views

Published on

In these slides, the overview of the RusProfiling shared task at PAN@FIRE 2017 in Bangalore, India.

This year task aimed at gender identification in Russian texts in a cross-genre perspective: training on Twitter, evaluating on Twitter, Facebook, reviews, essays and gender-imitated texts.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

RusProfiling Gender Identification in Russian Texts PAN@FIRE

  1. 1. RusProfiling Gender Identification in Russian Texts PAN@FIRE 2017 Bangalore, 8-10 December Francisco Rangel Autoritas Consulting Paolo Rosso PRHLT - Universitat Politècnica de València - Spain Pavel Seredin & Olga Litvinova RusProfiling Lab & Kurchatov Institute Russia Tatiana Litvinova RusProfiling Lab Russia
  2. 2. Introduction Author profiling aims at identifying personal traits such as age, gender, native language or personality traits from writings. This is crucial for: - Marketing - Security - Forensics 2 PAN@FIRE’17RusProfiling
  3. 3. Task goal To predict Gender in Russian text from a cross-genre perspective: - Essays. - Facebook. - Twitter. - Reviews. - Gender-Imitated texts. 3 PAN@FIRE’17RusProfiling
  4. 4. Corporadescription Dataset Genre Number of Authors Description Training Twitter 600 - Manually annotated (name, picture…) - From 1 to 200 tweets per author Test Essays 370 - Between 1 and 2 texts per author - From RusPersonality corpus - Topics: letter to a friend, picture description, letter to an employee... - Average length of 150 words Facebook 228 - Different age groups (20+, 30+, 40+) - Diffierent cities (minimum mutual friends) - Average length of 1,000 words Twitter 400 - A random partition ensuring no interjections with training authors Reviews 776 - From TrustPilot corpus - One text per author - Average lenght of 80 words Gender-Imitated 94 - From the Gender Imitation Corpus - Three texts per author: - Normal style - Imitating the other gender - Obfuscating her style PAN@FIRE’17RusProfiling
  5. 5. Evaluation measures 5 The accuracy measure is calculated per corpus. The final ranking is obtained by calculating the weighted accuracy such as if the corpus were concatenated. PAN@FIRE’17RusProfiling
  6. 6. 6 PAN@FIRE’17RusProfiling Baselines ● BASELINE-stat: A statistical baseline that emulates random choice. ● BASELINE-bow: ○ Documents represented as bag-of-words. ○ The 1,000 most common words in the training set. ○ Weighted by absolute frequency. ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. ● BASELINE-LDR: ○ Documents represented by the probability distribution of occurrence of their words in the different classes. ○ Each word is weighted depending on its probability of belonging to each class. ○ The distribution of weights for a given document should be closer to the weights of its corresponding class.
  7. 7. 7 PAN@FIRE’17RusProfiling Participants AmritaNLP [18] V. Vinayan, N. J.R., H. NB, A. Kumar M, and S. K P. Amritanlp@pan-rusprofiling: Author profiling using machine learning techniques. BITS_Pilani [1] R. Bhargava, G. Goel, A. Shah, and Y. Sharma. Gender identification in russian texts. CIC [7] I. Markov, H. Gomez-Adorno, G. Sidorov, and A. Gelbukh. The winning approach to cross-genre gender identification in russian at rusprofiling 2017. DUBL [17] G. Skitalinskaya, L. Akhtyamova, and J. Cardiff. Cross-genre gender identification in russian texts using topic modeling working note: Team dubl. RBG [3] B. Ganesh HB, A. Kumar M, and S. KP. Representation of target classes for text classification - amrita cen nlp@rusprofiling pan 2017.
  8. 8. 8 PAN@FIRE’17RusProfiling Participants’ runs per dataset Dataset Runs Essays 18 Facebook 19 Twitter 19 Reviews 19 Gender-Imitated 19 Total 93
  9. 9. Approaches 9 PAN@FIRE’17RusProfiling
  10. 10. Approaches - Preprocessing 10 Obtain plain text BITS_Pilani Remove stopwords BITS_Pilani, DUBL Remove short words DUBL Twitter specific elements (mentions, hashtags, urls) BITS_Pilani, DUBL Remove punctuation marks BITS_Pilani, CIC Remove numbers BITS_Pilani Remove non-cyrillic characters CIC Lemmatisation DUBL PAN@FIRE’17RusProfiling
  11. 11. Approaches - Features 11 AmritaNLP - Number of user mentions - Hashtags - Urls - Emoticons - Punctuation marks - Average word length - Tf-idf bag-of-words BITS_Pilani - Linguistic patterns such as word endings or the use of first person singular pronouns within a distance to a verb in past tense - (Combined with) Deep learning techniques CIC - Word and character n-grams - Words most frequently used per gender - Linguistic patterns such as word endings or the use of first person singular pronouns within a distance to a verb in past tense DUBL - Topic modelling RBG - A representation scheme based on the texts belonging to the corresponding target classes. PAN@FIRE’17RusProfiling
  12. 12. Approaches - Methods 12 Support Vector Machines AmritaNLP, CIC, RBG Random Forest AmritaNLP AdaBoost AmritaNLP Additive Regularization for Topic Modelling DUBL Rule-based BITS_Pilani Long-Short Term Memory networks BITS_Pilani PAN@FIRE’17RusProfiling
  13. 13. Results on Essays 13 PAN@FIRE’17RusProfiling - Best result: - A combination of linguistic rules and deep learning. - 10% higher than second best result. - Second best result: - Stylistic features with traditional machine learning. - 7 runs below the bow and majority baselines. - LDR baseline outperforms by 3% and 13% the best systems.
  14. 14. Results on Facebook 14 PAN@FIRE’17RusProfiling - 4 best results: - SVMs with combinations of n-grams and linguistic rules. - 2 results higher than 90% - 5 & 6 best result: - Linguistic rules combined with deep learning. - 5 runs below the majority baseline. - 12 runs below the bow baseline.
  15. 15. Results on Twitter 15 PAN@FIRE’17RusProfiling - 2 best results: - SVMs with combinations of n-grams and linguistic rules. - 3 best result: - Linguistic rules combined with deep learning. - 4 runs below the majority baseline. - Bow baseline below the majority baseline.
  16. 16. Results on Reviews 16 PAN@FIRE’17RusProfiling - 2 best results: - SVMs with combinations of n-grams and linguistic rules. - 3 & 4 best result: - Linguistic rules combined with deep learning. - 5 runs below the majority baselines. - Bow baseline ties the majority baseline. - LDR baseline outperforms by 4% the best system.
  17. 17. Results on Gender Imitation 17 PAN@FIRE’17RusProfiling - 2 best results: - Linguistic rules combined with deep learning. - 3 best result: - Stylistic features with traditional machine learning. - 4 - 7 best result: - SVMs with combinations of n-grams and linguistic rules. - 11 runs below the majority and bow baselines. - Most systems below 5% of increment over the majority baseline
  18. 18. Global ranking 18 PAN@FIRE’17RusProfiling - 4 best results: - SVMs with combinations of n-grams and linguistic rules. - 5 - 7 best results: - Stylistic features with traditional machine learning. - Deep learning approach does not participated in all the datasets. - 9 runs below the majority and 10 below the bow baselines. - LDR baseline outperforms the best result by 6.65%
  19. 19. Conclusions ● The task aimed at identifying gender from Russian texts from a cross-genre perspective: ○ Essays, Twitter, Facebook, reviews, gender-imitated. ● There have been 5 participants sending 93 runs. ● Accuracy was used to evaluate the systems. ● Several different features: ○ Traditional hand-crafted features such as word and character n-grams, and stylometrics, with traditional machine learning methods such as Support Vector Machines. ○ Deep learning techniques. ● Wrt. results: ○ Deep learning techniques obtained almost the best results, especially in essays and gender-imitated texts. ○ The best results were not achieved in Twitter but in Facebook. ○ Almost the worst results were obtained in reviews. 19 PAN@FIRE’17RusProfiling
  20. 20. 20 On behalf of the RusProfiling task organisers: Thank you very much for participating and hope to see you next year!! PAN@FIRE’17RusProfiling

×