The document summarizes the results of the PAN-AP-2013 task on author profiling, which aimed to identify the age and gender of authors based on text from social media in English and Spanish. 21 teams participated in identifying authors' age and gender for English texts, and 20 teams for Spanish texts. For English, the best performing teams achieved joint identification accuracy between 38-66%, with gender identification accuracy up to 59% and age identification up to 66%. For Spanish, joint identification accuracy was under 45% for all teams, with gender identification up to 65% and age up to 66% for the top teams. The document reviews the approaches, features, and methods used by the participating teams.
Overview of the PAN laboratory at CLEF 2016 in Évora.
It presents an overview on new challenges for authorship analysis from the perspectives of the cross-genre author profiling, author clustering and diarization, and author obfuscation.
Author profiling aims at identifying personal traits such as age, gender, native language or personality traits from writings. PR-SOCO task at PAN@FIRE goal is to predict Personality Traits from Source Codes.
These are the slides of the overview of the fourth Author Profiling task at PAN-CLEF 2019 presented in Lugano. This year task aimed at discriminating bots from humans in Twitter accounts, and in the case of humans, between males and females.
These are the slides of the overview of the eighth Author Profiling task at PAN-CLEF 2020 presented online. This year task aimed at Profiling Fake News spreaders on Twitter
Overview of the PAN laboratory at CLEF 2016 in Évora.
It presents an overview on new challenges for authorship analysis from the perspectives of the cross-genre author profiling, author clustering and diarization, and author obfuscation.
Author profiling aims at identifying personal traits such as age, gender, native language or personality traits from writings. PR-SOCO task at PAN@FIRE goal is to predict Personality Traits from Source Codes.
These are the slides of the overview of the fourth Author Profiling task at PAN-CLEF 2019 presented in Lugano. This year task aimed at discriminating bots from humans in Twitter accounts, and in the case of humans, between males and females.
These are the slides of the overview of the eighth Author Profiling task at PAN-CLEF 2020 presented online. This year task aimed at Profiling Fake News spreaders on Twitter
Automating the Diagnosis of Specific Language Impairment in School Aged ChildrenDavid O'Keeffe
A presentation on my work using Natural Language Processing and Machine Learning techniques to build a model that can predict whether a child has a language impairment or not.
In this webinar we discuss some of the things that need to be taken into consideration when making your website accessible in languages other than English. We spend a good amount of time going over the challenges and benefits of increasing accessibility and discuss the role machine translation.
In these slides, the overview of the fifth Author Profiling task at PAN-CLEF 2017 presented at Dublin.
This year task aimed at gender and language variety identification problems in Spanish, English, and as a novelty, Arabic and Portuguese.
These are the slides of the overview of the ninth Author Profiling task at PAN-CLEF 2021 presented online. This year task aimed at Profiling Hate Speech Spreaders on Twitter.
This analysis covers over 10,000 Hispanic verified conversations gathered from over 80,000 posts gathered from Facebook, Twitter, Blogs and the web at large about Netflix Originals’ in the drama section.
AL4Trust is the title of the speech given in the Applications of the Computational Linguistics subject at MIARFID'17 degree in Artificial Intelligence, Pattern Recognition and Digital Imaging at Universitat Politècnica de València.
It shows the importance of the artificial intelligence technologies applied in big data environments as part of the six pillars of the digital transformation.
Babak Rasolzadeh: The importance of entitiesZoltan Varju
Meltwater is a Business Intelligence company of +1000 individuals spread across ~60 offices in ~30 countries with over 26,000 clients. At Meltwater we see ourselves as a Outside Insights company, meaning we seek to deliver similar type of business analytics & insights as traditional CRM dashboards and ERP systems used to, except by leveraging data outside the firewall (social media, news, blogs etc.) we believe the insights can be much more decisive and predictive for our clients business. Part of the challenge with this is of course structuring the unstructured data out there. This is why the Data Science team at Meltwater has the mission to ingest, categorize, label, classify, and a whole range of other enrichments on the content that we crawl in order to index it properly in our big data architecture and make it available for our insights dashboard. We do these enrichments in +17 languages.
Babak Rasolzadeh is the Director of Data Science & NLP at Meltwater and has a team of 24 engineers on this team. Prior to Meltwater, Babak was the co-founder of OculusAI, a computer vision start-up in Sweden, that was sold to Meltwater in 2013. He holds a PhD in Computer Vision, from KTH in Sweden, and has worked on things ranging from self-driving cars to humanoid robots and mobile object recognition. He is an advisor for several startups here in US and Sweden.
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...Matt Shardlow
The slides I presented as part of the main conference of LREC 2014. If you're reading this beforehand then come along and see me talk in the flesh. Otherwise come find me and say hello.
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
Media analysts have to deal with with analyzing high volumes of real-time news feeds and social media streams which is often a tedious process because they need to write search profiles for entities. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable frameworks like Apache Flink. Apache Flink being a streaming first engine is ideally suited for ingesting multiple streams of news feeds, social media, blogs etc.. and for being able to do streaming analytics on the various feeds. Natural Language Processing tools like Apache OpenNLP can be plugged into Flink streaming pipelines so as to be able to perform common NLP tasks like Named Entity Recognition (NER), Chunking, and text classification. In this talk, we’ll be building a real-time media analyzer which does Named Entity Recognition (NER) on the individual incoming streams, calculates the co-occurrences of the named entities and aggregates them across multiple streams; index the results into a search engine and being able to query the results for actionable insights. We’ll also be showing as to how to handle multilingual documents for calculating co-occurrences. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large streams of data feeds and can easily be plugged into a highly scalable and distributed framework like Apache Flink.
User review sites as a resource for large scale sociolinguistic studiesHacer Tilbeç Turgut
Presentation of 'User Review Sites as a Resource for Large-Scale Sociolinguistic Studies' paper.
The paper is available on this link: http://www.www2015.it/documents/proceedings/proceedings/p452.pdf
These are the slides of the overview of the ninth Author Profiling task at PAN-CLEF 2022 presented online. This year task aimed at Profiling Irony and Stereotype Spreaders.
AL4Trust is the title of the speech given in the Applications of the Computational Linguistics subject at MIARFID'19 degree in Artificial Intelligence, Pattern Recognition and Digital Imaging at Universitat Politècnica de València.
It shows the importance of the artificial intelligence technologies applied in big data environments as part of the six pillars of the digital transformation.
Diapositivas utilizadas en mi charla a los alumnos del máster Universitario en Sistemas Inteligentes de la Universitat Jaume I de Castellón. En la charla presento dos aproximaciones a los problemas de author profiling de identificación de sexo y edad, y de variedad del lenguaje, haciendo hincapié en la doble perspectiva universidad-empresa cuando se trata del rendimiento de los métodos aplicados: precisos y/o rápidos.
Automating the Diagnosis of Specific Language Impairment in School Aged ChildrenDavid O'Keeffe
A presentation on my work using Natural Language Processing and Machine Learning techniques to build a model that can predict whether a child has a language impairment or not.
In this webinar we discuss some of the things that need to be taken into consideration when making your website accessible in languages other than English. We spend a good amount of time going over the challenges and benefits of increasing accessibility and discuss the role machine translation.
In these slides, the overview of the fifth Author Profiling task at PAN-CLEF 2017 presented at Dublin.
This year task aimed at gender and language variety identification problems in Spanish, English, and as a novelty, Arabic and Portuguese.
These are the slides of the overview of the ninth Author Profiling task at PAN-CLEF 2021 presented online. This year task aimed at Profiling Hate Speech Spreaders on Twitter.
This analysis covers over 10,000 Hispanic verified conversations gathered from over 80,000 posts gathered from Facebook, Twitter, Blogs and the web at large about Netflix Originals’ in the drama section.
AL4Trust is the title of the speech given in the Applications of the Computational Linguistics subject at MIARFID'17 degree in Artificial Intelligence, Pattern Recognition and Digital Imaging at Universitat Politècnica de València.
It shows the importance of the artificial intelligence technologies applied in big data environments as part of the six pillars of the digital transformation.
Babak Rasolzadeh: The importance of entitiesZoltan Varju
Meltwater is a Business Intelligence company of +1000 individuals spread across ~60 offices in ~30 countries with over 26,000 clients. At Meltwater we see ourselves as a Outside Insights company, meaning we seek to deliver similar type of business analytics & insights as traditional CRM dashboards and ERP systems used to, except by leveraging data outside the firewall (social media, news, blogs etc.) we believe the insights can be much more decisive and predictive for our clients business. Part of the challenge with this is of course structuring the unstructured data out there. This is why the Data Science team at Meltwater has the mission to ingest, categorize, label, classify, and a whole range of other enrichments on the content that we crawl in order to index it properly in our big data architecture and make it available for our insights dashboard. We do these enrichments in +17 languages.
Babak Rasolzadeh is the Director of Data Science & NLP at Meltwater and has a team of 24 engineers on this team. Prior to Meltwater, Babak was the co-founder of OculusAI, a computer vision start-up in Sweden, that was sold to Meltwater in 2013. He holds a PhD in Computer Vision, from KTH in Sweden, and has worked on things ranging from self-driving cars to humanoid robots and mobile object recognition. He is an advisor for several startups here in US and Sweden.
LREC 2014 - Out in the open: Finding and categorising errors in the lexical s...Matt Shardlow
The slides I presented as part of the main conference of LREC 2014. If you're reading this beforehand then come along and see me talk in the flesh. Otherwise come find me and say hello.
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
Media analysts have to deal with with analyzing high volumes of real-time news feeds and social media streams which is often a tedious process because they need to write search profiles for entities. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable frameworks like Apache Flink. Apache Flink being a streaming first engine is ideally suited for ingesting multiple streams of news feeds, social media, blogs etc.. and for being able to do streaming analytics on the various feeds. Natural Language Processing tools like Apache OpenNLP can be plugged into Flink streaming pipelines so as to be able to perform common NLP tasks like Named Entity Recognition (NER), Chunking, and text classification. In this talk, we’ll be building a real-time media analyzer which does Named Entity Recognition (NER) on the individual incoming streams, calculates the co-occurrences of the named entities and aggregates them across multiple streams; index the results into a search engine and being able to query the results for actionable insights. We’ll also be showing as to how to handle multilingual documents for calculating co-occurrences. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large streams of data feeds and can easily be plugged into a highly scalable and distributed framework like Apache Flink.
User review sites as a resource for large scale sociolinguistic studiesHacer Tilbeç Turgut
Presentation of 'User Review Sites as a Resource for Large-Scale Sociolinguistic Studies' paper.
The paper is available on this link: http://www.www2015.it/documents/proceedings/proceedings/p452.pdf
These are the slides of the overview of the ninth Author Profiling task at PAN-CLEF 2022 presented online. This year task aimed at Profiling Irony and Stereotype Spreaders.
AL4Trust is the title of the speech given in the Applications of the Computational Linguistics subject at MIARFID'19 degree in Artificial Intelligence, Pattern Recognition and Digital Imaging at Universitat Politècnica de València.
It shows the importance of the artificial intelligence technologies applied in big data environments as part of the six pillars of the digital transformation.
Diapositivas utilizadas en mi charla a los alumnos del máster Universitario en Sistemas Inteligentes de la Universitat Jaume I de Castellón. En la charla presento dos aproximaciones a los problemas de author profiling de identificación de sexo y edad, y de variedad del lenguaje, haciendo hincapié en la doble perspectiva universidad-empresa cuando se trata del rendimiento de los métodos aplicados: precisos y/o rápidos.
These are the slides of the overview of the fourth Author Profiling task at PAN-CLEF 2018 presented at Avignon. This year task aimed at multimodal (texts + images) gender identification of Twitter users.
In these slides, the overview of the RusProfiling shared task at PAN@FIRE 2017 in Bangalore, India.
This year task aimed at gender identification in Russian texts in a cross-genre perspective: training on Twitter, evaluating on Twitter, Facebook, reviews, essays and gender-imitated texts.
These are the slides of the overview of the fourth Author Profiling task at PAN-CLEF 2017 presented at Evora. This year task aimed at cross-genre evaluation of the age and gender identification problems.
Cyberacoso (cyber bullying), cyberabuso (cyber grooming), la ballena azul, el abecedario del diablo, la privacidad en las redes sociales, lo perjudicial de estar siempre conectado las redes sociales, el postureo y la apariencia...
Las redes sociales son maravillosas, permiten una interconexión con el mundo impensable cuando algunos éramos pequeños, pero hay que tener ciertas precauciones y así se lo tenemos que hacer ver a nuestros (pre)adolescentes para que las usen con sentido y responsabilidad, y sean capaces de detectar y denunciar casos como los anteriores.
Esta charla fue dada a mi hija mayor y tres de mis sobrinas que, a priori, ya estaban de vuelta y media y creían que se lo sabían todo. Sus caras lo decían todo...
Presentación de Autoritas en la mesa redonda de las jornadas Activa tu Futuro de la Universitat Politècnica de València sobre el futuro de las comunicaciones personales a través de los dispositivos móviles y su análisis mediante tecnologías big data.
El objetivo de las jornadas es dar a conocer los másteres de la UPV, como el master en Big Data donde Autoritas participa activamente. En esta ponencia mostramos las diferentes problemáticas a solucionar en la generación de inteligencia social de negocio y las oportunidades que se brindan a los profesionales que deseen activar su futuro en tecnologías de análisis del big data.
Ponencia sobre Escucha Inteligente en el Master Universitario en Ingeniería Informática (MUIinf). Como caso práctico se explica el geoposicionamiento basado en la identificación de variedad del lenguaje.
Ponencia realizada en la asignatura de Aplicaciones para la Lingüística Computacional de la edición del 2016 del Master en Inteligencia Artificial, Reconocimiento de Patrones e Imagen Digital de la Universitat Politècnica de València.
El objetivo de la ponencia es mostrar a los alumnos que lo que han estudiado en el master es de gran utilidad en la sociedad actual, tanto académica como empresarial, pero que cuando se encuentren en entornos reales, cada vez más relacionados con el big data, van a tener que lidiar con una serie de problemas y decisiones donde van a tener que equilibrar entre diferentes aspectos de la calidad de los resultados, lo que por otra parte les va a brindar enormes oportunidades de desarrollo profesional.
Language variety identification aims at labelling texts in a native lan- guage (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Ar- gentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ∼35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show com- petitive performance while dramatically reducing the dimensionality — and in- creasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.
Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language. In this work we approach the task by using distributed representations based on Mikolov et al. investigations.
Our aim is at investigating how people use the language, and especially how they convey verbal emotions, to determine their age and gender. We propose EmoGraph, a graph-based approach that captures how people use language and convey verbal emotions in order to identify their age and gender. Results are competitive with state-of-the-art ones and robust against languages and genres.
Summary of my Phd first year:
- Language use depending on the Internet channel
- Emotions identification in Facebook
- PAN-13 Author Profiling organisation
- Some experiments and results with PAN-13-AP dataset
Los retos a los que se enfrenta un científico de datos en la era del Big Data son múltiples: recuperación de información, procesamiento del lenguaje natural, aprendizaje automático, programación distribuída, bases de datos no-sql, y un largo etcétera. Cuando el científico de datos además trabaja en la empresa tiene que orientar todos los resultados de sus investigaciones hacia la consecución de los objetivos empresariales, esto
es, incrementar el valor económico. Las decisiones y los plazos por lo tanto van ligados a incrementar una función económica y no sólo a empujar el estado de la cuestión. Por otro lado, la evaluación de los resultados se
efectua de manera totalmente subjetiva en base a la percepción de usuarios no siempre doctos en la materia. El resultado es por tanto que el científico de datos en la empresa tiene que desarrollar una dualidad de skills que combinan lo técnico con lo no-técnico, sufriendo una maraña de sentimientos contradictorios como euforia ante el reto y desesperación ante los imposibles.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Author Profiling. PAN@CLEF-2013 Task
1. Author Profiling
PAN-AP-2013 - CLEF 2013
Valencia, 24th September 2013
Francisco Rangel
Autoritas / Universitat
Politècnica de València
Paolo Rosso
Universitat Politècnica
de València
Moshe Koppel
Bar-Illan University
Efstathios Stamatatos
University of the Aegean
Giacomo Inches
University of Lugano
4. 4
Task Goals
‣Given a collection of documents retrieved from
Social Media in English and Spanish...
MAIN GOAL
Identify age and
gender
SECONDARY GOALS
Test the robustness of the
approaches for identifying age and
gender of predators
Measure the computational time
needed to perform the task
5. 5
Related Work
AUTHOR COLLECTION FEATURES RESULTS
OTHER
CHARACTERISTICS
Argamon et al., 2002 British National Corpus Part-of-speech Gender: 80% accuracy
Holmes & Meyerhoff,
2003
Formal texts - Age and gender
Burger & Henderson,
2006
Blogs
Posts length, capital letters,
punctuations. HTML features.
They only reported:
“Low percentage errors”
Two age classes: [0,18[,[18,-]
Koppel et al., 2003 Blogs
Simple lexical and syntactic
functions
Gender: 80% accuracy Self-labeling
Schler et al., 2006 Blogs
Stylistic features + content words
with the highest information gain
Gender: 80% accuracy
Age: 75% accuracy
Goswami et al., 2009 Blogs Slang + sentence length
Gender: 89.18 accuracy
Age: 80.32 accuracy
Zhang & Zhang, 2010 Segments of blog
Words, punctuation, average
words/sentence length, POS, word
factor analysis
Gender: 72,10 accuracy
Nguyen et al., 2011 y
2013
Blogs & Twitter Unigrams, POS, LIWC
Correlation: 0.74
Mean absolute error: 4.1
- 6.8 years
Manual labeling
Age as continuous variable
Peersman et al., 2011 Netlog Unigrams, bigrams, trigrams and
tetagrams
Gender+Age: 88.8
accuracy
Self-labeling, min 16 plus
16,18,25
6. 6
Data Collection - Social Media
‣Big Data?
‣High variety of themes
‣Sexual conversations vs. sexual predators
‣Difficulty to obtain good label data
‣Real people vs. Robots (chatbots)
‣Multilingual: English + Spanish
7. 7
Data Collection - English Distribution
MIN MAX AVG STD
0 22,736 335 208
Numberofdocuments
Number of words
8. 8
Data Collection - English Distribution (zoomed)
‣ If we zoom the distribution, we can observe a gaussian like distribution, with its maximum on the value 415.
335 495
415
Numberofdocuments
Number of words
9. 9
Data Collection - English Distribution (log-log)
‣ The log-log representation shows how the distribution has a long tail component, specifically in two cases,
one before the point of maximum frequency and another one after this.
‣ We could use this property to select minimum and maximum number of words that the posts must have.
Numberofdocuments(log)
Number of words (log)
10. 10
Data Collection - Spanish Distribution
Numberofdocuments
Number of words
MIN MAX AVG STD
0 12,246 176 832
11. 11
Data Collection - Spanish Distribution (zoomed)
500
15
Numberofdocuments
Number of words
12. 12
Data Collection - Spanish Distribution (log-log)
Numberofdocuments(log)
Number of words (log)
13. 13
Data Collection - Selection Criteria
‣ Grouping posts by
author
‣ Keeping authors with
few post
‣ Chunking authors with
more than 1,000 words
‣ Balanced by gender
‣ A g e g r o u p s ( n o n -
balanced):
‣ 10s (13-17)
‣ 20s (23-27)
‣ 30s (33-47)
‣ Random split in three
datasets
‣ Training
‣ Early Bird (10%)
‣ Testing (+20%)
‣ Introduction of few special cases
‣ Predators (0.0012%)
‣ Adult-adult sexual conversations
14. 14
Data Collection - Statistics
LANG AGE GENDERLANG AGE GENDERLANG AGE GENDER
NUMBER OF AUTHORSNUMBER OF AUTHORSNUMBER OF AUTHORS
LANG AGE GENDERLANG AGE GENDERLANG AGE GENDER
TRAINING EARLY BIRDS TEST
EN
10s
MALE 8,600 740 888
EN
10s
FEMALE 8,600 740 888
EN 20s
MALE (72) 42,828 3,840 (32) 4,576
EN 20s
FEMALE (25) 42,875 3,840 (10) 4,576
EN
30s
MALE (92) 66,708 6,020 (40) 7,184
EN
30s
FEMALE 66,800 6,020 7,224
236,600 21,200 25,440
ES
10s
MALE 1,250 120 144
ES
10s
FEMALE 1,250 120 144
ES 20s
MALE 21,300 1,920 2,304
ES 20s
FEMALE 21,300 1,920 2,304
ES
30s
MALE 15,400 1,360 1,632
ES
30s
FEMALE 15,400 1,360 1,632
75,900 6,800 8,160
Predators
Adult-adult sexual conversations
15. 15
Performance measures for identification
Accuracy for
Gender
Accuracy for
Age
Accuracy for
Gender
Accuracy for
Age
ENGLISH SPANISH
Joint Accuracy Joint Accuracy
Average Accuracy
WINNER OF THE TASK
16. 16
Other performance measures
Number of correctly identified gender and age for
predators
Total time needed to process the test data
Number of correctly identified gender and age for
sexual conversations between adults
17. 17
Participants
‣ 66 registered teams
‣ 21 participants (32%)
‣ 16 countries
‣ 18 papers (86%)
‣ 8 long papers
‣ 10 short papers
19. 19
Approaches
HTML Cleaning to obtain plain text
5 teams: [gopal-patra][moreau][meina]
[weren][pavan]
Deletion of documents with at least 0.1%
of spam words
1 team: [flekova]
Principal Component Analysis to reduce
dimensionality
1 team: [yong-lim]
Subset selection during training to reduce
dimensionality
5 teams: [caurcel-diaz][flekova][moreau]
[hernandez-farias][sapkota]
Discrimination between human-like posts
and spam-like posts (chatbots)
1 team: [meina]
Preprocessing
20. 20
Approaches
Stylistic features: frequencies of
punctuation marks, capital letters,
quotations...
9 teams: [yong-lim][cruz][pavan][gopal-
patra][de-arteaga][meina][flekova]
[aleman][santosh]
+ POS tags
5 teams: [yong lim][meina][aleman][cruz]
[santosh]
HTML-based features like image urls or
links
3 teams: [santosh][sapkota][meina]
Readability
7 teams: [gopal-patra][yong-lim][meina]
[flekova][aleman][weren][gillam]
Emoticons
2 teams: [aleman][hernandez-farias]
*[sapkota] explicitly discarded them
Features
21. 21
Approaches
Content features: LSA, BoW,TF-IDF,
dictionary-based words, topic-based
words, entropy-based words...
11 teams: [sapkota][gopal-patra][yong-
lim][seifeddine][caurcel-diaz][flekova]
[meina][cruz][santosh][pavan]
[hernandez-farias]
Named entities 1 team: [flekova]
Sentiment words 1 team: [gopal-patra]
Emotions words 1 team: [meina]
Slang, contractions and words with
character flooding
4 teams: [flekova][caurcel-diaz][aleman]
[hernandez-farias]
Features
22. 22
Approaches
Text to be identified is used as a query
for a search engine
1 team: [weren]
Unsupervised features based on statistics 1 team: [de-arteaga]
Language models (n-grams)
4 teams: [meina][jankowska][moreau]
[sapkota]
Collocations 1 team: [meina]
Second order representation based on
relationships between documents and
profiles
1 team: [pastor]
Features
24. 24
Early birds results
‣5 teams participated, 1 team had technical problems
‣Figures for gender are very close to baseline
‣Main goal -> All participants improved in the final
evaluation, mainly Aleman
25. 25
Final results
‣ 21 teams for English,
20 teams for Spanish
‣ Values similar to
Early Birds
‣ Gender close to
baseline
‣ Joint identification
more difficult
26. 26
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
All results below 40%
BASELINE: 16%
2 teams below baseline
Joint Identification
27. 27
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.4781
0.5921
All results below 60%
4 teams below 1% better
than baseline
BASELINE: 50%
3 teams below baseline
Gender Identification
28. 28
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.1234
0.6572
All results below 70%
7 teams over 60%
9 teams between 50-60%
3 teams below 50%
BASELINE: 33%
2 teams below baseline
Age Identification
29. 29
Results for English
Meina
Pastor L.
Seifeddine
Santosh
Yong Lim
Ladra
Aleman
Gillam
Kern
Cruz
Pavan
Caurcel Diaz
H. Farias
Jankowska
Flekova
Weren
Sapkota
De-Arteaga
Moreau
BASELINE
Gopal Patra
Cagnina
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.0741 0.1234
0.4781
0.6572
0.5921
0.3894
Joint identification Gender Age
The best teams
performed better in
both identifications
30. 30
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
All results below 45%
BASELINE: 16%
2 teams below baseline
Joint Identification
31. 31
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.4784
0.6473
All results below 65%
2 teams equal to baseline
BASELINE: 50%
3 teams below baseline
Gender Identification
32. 32
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
0.0512
0.6558
All results below 70%
3 teams over 60%
9 teams between 50-60%
6 teams below 50%
BASELINE: 33%
2 teams below baseline
Age Identification
33. 33
Results for Spanish
Santosh
Pastor L.
Cruz
Flekova
Ladra
De-Arteaga
Kern
Yong Lim
Sapkota
Pavan
Jankowska
Meina
Gillam
Moreau
Weren
Cagnina
Caurcel Diaz
H. Farias
BASELINE
Aleman
Seifeddine
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 10.0287 0.0512
0.4784
0.6558
0.64730.4208
Total Gender Age
The best teams
performed better in
both identifications
34. 34
Results per language
Pastor
Santosh
Cruz
Ladra
Yong Lim
Flekova
Meina
Kern
Gillam
Pavan
De-Arteaga
Jankowska
Sapkota
Weren
Moreau
Aleman
Caurcel Diaz
H. Farias
Seifeddine
Cagnina
Gopal Patra
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
English Spanish
For 10 team English is better
For 10 team Spanish is better
1 team only participated in English
37. 37
Conclusions
Also...
‣ We received many different and enriching approaches
‣ We were one of the task with the higher number of participants at CLEF
(21)
‣ Interest from many teams (66 registered) but the task was new and many
(5) did not make it (potentially more participation next year!)
‣Very difficult task, mainly for
gender identification
‣ Difficult to identify together
age and gender
For predators...
‣ Robust identifying age
‣ Better identifying gender
‣ Expensive in Time consuming ->
Big Data problem?
38. Francisco Rangel
Autoritas Consulting /
Universitat Politècnica deValència
Paolo Rosso
Universitat Politècnica deValència
Moshe Koppel
Bar-Ilan University
Efstathios Stamatatos
University of the Aegean
Giacomo Inches
University of Lugano
On behalf of the AP task organisers:
Thank you very much for participating!
We hope to see you again next year!