SlideShare a Scribd company logo
1 of 4
Download to read offline
Gender Detection in Blogs
(Project Number - 17)
A Project Report
Submitted by
Group Number - 37
Subba Reddy 201406632
Rashmi Sharma 201405581
Abhijeet Thakur 201264203
Guided by
Dr. Vasudev Verma
Mentored by
Vishrut Mehta
For the course
Information Retrieval and Extraction
IIIT, Hyderabad
April, 2015
1. Abstract
The question addressed in this paper is : given a short text document, can we identify
if the author is a man or a woman? This question is motivated by recent events where
people faked their gender on the Internet. Note that this is different from the authorship
attribution problem.
Three machine learning algorithms (support vector machine, Bayesian logistic regres-
sion and AdaBoost decision tree) are then designed for gender identification based on 545
psycho-linguistic and gender-preferential cues along with the stylometric features.
Out of these three - support vector machine gives the highest accuracy of 85.1% in
gender identification.
2. Project Scope
The goal of this project is, given a blog, you need to analyze the specific features in
the text differentiating whether it is written by a male or a female.
The features can be anything, for example, if a blog is about dresses, or cats then it
may be written by a female, and if a blog is about sports, suits, etc then it would be
written by a male. But in this project, you should also analyze the salient features which
differentiate the text content and not merely on the topic of the text.
3. Related Systems
• Authorship identification : Authorship is calculated by determining if one piece
of text contained significantly longer words than another. Histograms of word-
length distribution were also used for the same.
• Gender Guesser : This tool attempts to determine an author’s gender based on
the words used. Submitted text is evaluated based on two types of writing: formal
and informal. Formal writing includes fiction and non-fiction stories, articles, and
news reports. Informal writing includes blog and chat-room text.
1
• Author gender identification from text : In a research researchers presented
a group of lexical, syntactic and pragmatic features, which would distinguish the
language style of women, namely, the use of specialized vocabulary, expletives, tag.
4. Proposed System / Approach
• Collecting a suitable corpus of text messages to be the dataset.
• Identifying features that are significant indicators of gender.
• Extracting feature values from each message automatically.
• Building a classification model to identify the author’s gender of a candidate text
message.
Figure 4.1: Gender Identification Process
2
5. Dataset
We will be using the dataset from the proceedings of PAN 2013 and 2014. The 2013
dataset comprises of blog posts while the 2014 dataset also includes tweets. The original
use of this dataset was for the problem of Author Profiling; more specifically determining
the author’s age and gender.
Dataset link: http://pan.webis.de/
6. Evaluation and Analysis
• Training Phase : The classifier was trained with 4 different number of blogs :
50, 100, 200 and 500.
• Testing Phase : In each case, 70% was used for training and 30% was used for
testing.
Corpus Training Testing Accuracy
100 70 30 70.37%
200 140 60 70%
260 184 76 68.94%
500 350 150 669.76%
7. Conclusion and Future Work
By designing appropriate psycho- linguistic and gender-linked features, we observe
that word- based features, function words and structural features play important roles in
gender identification. Experimental results indicate that the identification performance
is improved by increasing the number of text documents in the training dataset as well
as the number of words in each document (e-mail). We find that there are significant
differences between men and women in personal writings such as e-mails, and gender
differences also exist between authors of news articles even though neutral language is
dominant there.
3

More Related Content

What's hot

Effective Navigation of Query Results Based On Hierarchies
Effective Navigation of Query Results Based On HierarchiesEffective Navigation of Query Results Based On Hierarchies
Effective Navigation of Query Results Based On Hierarchies
Akhil Ambekar
 

What's hot (19)

Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrieval
 
Email Classification
Email ClassificationEmail Classification
Email Classification
 
Viva
VivaViva
Viva
 
Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content
Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-contentPenguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content
Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content
 
Information Retrieval-1
Information Retrieval-1Information Retrieval-1
Information Retrieval-1
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
 
A Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie Reviews
 
A statistical model for gist generation a case study on hindi news article
A statistical model for gist generation  a case study on hindi news articleA statistical model for gist generation  a case study on hindi news article
A statistical model for gist generation a case study on hindi news article
 
Effective Navigation of Query Results Based On Hierarchies
Effective Navigation of Query Results Based On HierarchiesEffective Navigation of Query Results Based On Hierarchies
Effective Navigation of Query Results Based On Hierarchies
 
K0936266
K0936266K0936266
K0936266
 
Email Classification - Why Should it Matter to You?
Email Classification - Why Should it Matter to You?Email Classification - Why Should it Matter to You?
Email Classification - Why Should it Matter to You?
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
 
Text classification
 Text classification Text classification
Text classification
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
CS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_RecommendationCS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_Recommendation
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
 
Relevance feature discovery for text mining
Relevance feature discovery for text miningRelevance feature discovery for text mining
Relevance feature discovery for text mining
 
76 s201906
76 s20190676 s201906
76 s201906
 

Similar to SubbuProjectReport

Personality Prediction using Logistic Regression
Personality Prediction using Logistic RegressionPersonality Prediction using Logistic Regression
Personality Prediction using Logistic Regression
ijtsrd
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
20211a05p7
 

Similar to SubbuProjectReport (20)

NLP Ecosystem
NLP EcosystemNLP Ecosystem
NLP Ecosystem
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
 
ACL-IJCNLP 2015
ACL-IJCNLP 2015ACL-IJCNLP 2015
ACL-IJCNLP 2015
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
Personality Prediction using Logistic Regression
Personality Prediction using Logistic RegressionPersonality Prediction using Logistic Regression
Personality Prediction using Logistic Regression
 
The sarcasm detection with the method of logistic regression
The sarcasm detection with the method of logistic regressionThe sarcasm detection with the method of logistic regression
The sarcasm detection with the method of logistic regression
 
Co-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online ReviewsCo-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online Reviews
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media Data
 
O017148084
O017148084O017148084
O017148084
 
E017433538
E017433538E017433538
E017433538
 
Content analysis
Content analysis Content analysis
Content analysis
 
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
 
Naresh sharma
Naresh sharmaNaresh sharma
Naresh sharma
 
Integrated expert recommendation model for online communitiesst02
Integrated expert recommendation model for online communitiesst02Integrated expert recommendation model for online communitiesst02
Integrated expert recommendation model for online communitiesst02
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithm
 
opinion feature extraction using enhanced opinion mining technique and intrin...
opinion feature extraction using enhanced opinion mining technique and intrin...opinion feature extraction using enhanced opinion mining technique and intrin...
opinion feature extraction using enhanced opinion mining technique and intrin...
 
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKINGTOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
TOWARDS A MULTI-FEATURE ENABLED APPROACH FOR OPTIMIZED EXPERT SEEKING
 
A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...A simplified classification computational model of opinion mining using deep ...
A simplified classification computational model of opinion mining using deep ...
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
 
Major presentation
Major presentationMajor presentation
Major presentation
 

SubbuProjectReport

  • 1. Gender Detection in Blogs (Project Number - 17) A Project Report Submitted by Group Number - 37 Subba Reddy 201406632 Rashmi Sharma 201405581 Abhijeet Thakur 201264203 Guided by Dr. Vasudev Verma Mentored by Vishrut Mehta For the course Information Retrieval and Extraction IIIT, Hyderabad April, 2015
  • 2. 1. Abstract The question addressed in this paper is : given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. Three machine learning algorithms (support vector machine, Bayesian logistic regres- sion and AdaBoost decision tree) are then designed for gender identification based on 545 psycho-linguistic and gender-preferential cues along with the stylometric features. Out of these three - support vector machine gives the highest accuracy of 85.1% in gender identification. 2. Project Scope The goal of this project is, given a blog, you need to analyze the specific features in the text differentiating whether it is written by a male or a female. The features can be anything, for example, if a blog is about dresses, or cats then it may be written by a female, and if a blog is about sports, suits, etc then it would be written by a male. But in this project, you should also analyze the salient features which differentiate the text content and not merely on the topic of the text. 3. Related Systems • Authorship identification : Authorship is calculated by determining if one piece of text contained significantly longer words than another. Histograms of word- length distribution were also used for the same. • Gender Guesser : This tool attempts to determine an author’s gender based on the words used. Submitted text is evaluated based on two types of writing: formal and informal. Formal writing includes fiction and non-fiction stories, articles, and news reports. Informal writing includes blog and chat-room text. 1
  • 3. • Author gender identification from text : In a research researchers presented a group of lexical, syntactic and pragmatic features, which would distinguish the language style of women, namely, the use of specialized vocabulary, expletives, tag. 4. Proposed System / Approach • Collecting a suitable corpus of text messages to be the dataset. • Identifying features that are significant indicators of gender. • Extracting feature values from each message automatically. • Building a classification model to identify the author’s gender of a candidate text message. Figure 4.1: Gender Identification Process 2
  • 4. 5. Dataset We will be using the dataset from the proceedings of PAN 2013 and 2014. The 2013 dataset comprises of blog posts while the 2014 dataset also includes tweets. The original use of this dataset was for the problem of Author Profiling; more specifically determining the author’s age and gender. Dataset link: http://pan.webis.de/ 6. Evaluation and Analysis • Training Phase : The classifier was trained with 4 different number of blogs : 50, 100, 200 and 500. • Testing Phase : In each case, 70% was used for training and 30% was used for testing. Corpus Training Testing Accuracy 100 70 30 70.37% 200 140 60 70% 260 184 76 68.94% 500 350 150 669.76% 7. Conclusion and Future Work By designing appropriate psycho- linguistic and gender-linked features, we observe that word- based features, function words and structural features play important roles in gender identification. Experimental results indicate that the identification performance is improved by increasing the number of text documents in the training dataset as well as the number of words in each document (e-mail). We find that there are significant differences between men and women in personal writings such as e-mails, and gender differences also exist between authors of news articles even though neutral language is dominant there. 3