1. Gender Detection in Blogs
(Project Number - 17)
A Project Report
Submitted by
Group Number - 37
Subba Reddy 201406632
Rashmi Sharma 201405581
Abhijeet Thakur 201264203
Guided by
Dr. Vasudev Verma
Mentored by
Vishrut Mehta
For the course
Information Retrieval and Extraction
IIIT, Hyderabad
April, 2015
2. 1. Abstract
The question addressed in this paper is : given a short text document, can we identify
if the author is a man or a woman? This question is motivated by recent events where
people faked their gender on the Internet. Note that this is different from the authorship
attribution problem.
Three machine learning algorithms (support vector machine, Bayesian logistic regres-
sion and AdaBoost decision tree) are then designed for gender identification based on 545
psycho-linguistic and gender-preferential cues along with the stylometric features.
Out of these three - support vector machine gives the highest accuracy of 85.1% in
gender identification.
2. Project Scope
The goal of this project is, given a blog, you need to analyze the specific features in
the text differentiating whether it is written by a male or a female.
The features can be anything, for example, if a blog is about dresses, or cats then it
may be written by a female, and if a blog is about sports, suits, etc then it would be
written by a male. But in this project, you should also analyze the salient features which
differentiate the text content and not merely on the topic of the text.
3. Related Systems
• Authorship identification : Authorship is calculated by determining if one piece
of text contained significantly longer words than another. Histograms of word-
length distribution were also used for the same.
• Gender Guesser : This tool attempts to determine an author’s gender based on
the words used. Submitted text is evaluated based on two types of writing: formal
and informal. Formal writing includes fiction and non-fiction stories, articles, and
news reports. Informal writing includes blog and chat-room text.
1
3. • Author gender identification from text : In a research researchers presented
a group of lexical, syntactic and pragmatic features, which would distinguish the
language style of women, namely, the use of specialized vocabulary, expletives, tag.
4. Proposed System / Approach
• Collecting a suitable corpus of text messages to be the dataset.
• Identifying features that are significant indicators of gender.
• Extracting feature values from each message automatically.
• Building a classification model to identify the author’s gender of a candidate text
message.
Figure 4.1: Gender Identification Process
2
4. 5. Dataset
We will be using the dataset from the proceedings of PAN 2013 and 2014. The 2013
dataset comprises of blog posts while the 2014 dataset also includes tweets. The original
use of this dataset was for the problem of Author Profiling; more specifically determining
the author’s age and gender.
Dataset link: http://pan.webis.de/
6. Evaluation and Analysis
• Training Phase : The classifier was trained with 4 different number of blogs :
50, 100, 200 and 500.
• Testing Phase : In each case, 70% was used for training and 30% was used for
testing.
Corpus Training Testing Accuracy
100 70 30 70.37%
200 140 60 70%
260 184 76 68.94%
500 350 150 669.76%
7. Conclusion and Future Work
By designing appropriate psycho- linguistic and gender-linked features, we observe
that word- based features, function words and structural features play important roles in
gender identification. Experimental results indicate that the identification performance
is improved by increasing the number of text documents in the training dataset as well
as the number of words in each document (e-mail). We find that there are significant
differences between men and women in personal writings such as e-mails, and gender
differences also exist between authors of news articles even though neutral language is
dominant there.
3