Gender Detection In Blogs [Information Retrival and Extraction]

Presented By (Team No. 32)
Nitish Jain (201301227)
Ganesh Borle (201505587)
Vamshikrishna Reddy (201202177)
Mentored By
Lokesh Walase
IRE [CSE474]

ABSTRACT
�Through the sands of time, textual content has remained a
prominent feature of internet media especially BLOGS.
�Thus, author profiling and attribution becomes an important
and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it
should be the author itself.
● But . . .
● lot of content brings a lot of responsibility

Given a text blog , can we identify whether
the writer is a male or a female ?
The Question

THE APPROACH
�An ensemble is applied on these models and the input
document is classified as written by male or female.
● We take advantage of the linguistic features of the
blog and create a feature file.
● This feature file is then trained on various classifier and a
model for each of the classifier is prepared.

� each document contains text of about ~35 blogs
in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset
● Koppels blog dataset
● contains about 19 thousand document

PARSING
● Language used : Python
● Each blog is entry stored in XML format
<Blog>
<date>....... </date>
<post>
….
</post>
...
<Blog>
● Each of the blog filename contains the name and Gender
of the author

FEATURES
For our task of Gender Identification, we take the help of
the following linguistic features:
�Character Based Features
�Word Based Features
�Syntactic Features
�Structural Features
�Function Words
�POS Start Probability

THE CLASSIFICATION TASK
For the task of classification, we used several classifying
algorithms and arrived at a model that uses ensemble of the
following classification algorithms:
�Random Forest Classifier
�Neural Networks Classifier
�Adaboost Tree Classifier
�Gradient Boosting Classifier
�Bagging Classifier

THE CLASSIFICATION TASK
For each of the classifier
�We fed it with partial features to actually see the variation
of accuracies with the features.
�We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the
majority class from the classified results of the classifiers.

RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number
of decision tree classifiers on various
sub-samples of the dataset
● By using Random Forest Classifier we
were able to achieve an accuracy of
69.79%

NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes
with each layer fully connected to the
next layer nodes and each node is a
neuron with non-linear perceptron.
● Uses a supervised learning called
backpropagation for training the
network.
● By using Neural Networks Classifier
we were able to achieve an accuracy
of 69.51%

ADABOOST TREE CLASSIFIER
● An meta estimator that begins by
fitting a classifier on the original
dataset and then fits the next round
classifiers on the same dataset
● By using Adaboost tree Classifier we
were able to achieve an accuracy of
69.57%

GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise
fashion.
● In each of the next stages weak
classifiers are introduced to
compensate the shortcomings of the
existing weak learners and these
shortcomings are identified by the
gradients.
● By using Gradient Boosting Classifier
of 70.81%

BAGGING CLASSIFIER
● A meta estimator that fits the base
classifiers each on random subsets of
the datasets and then aggregate their
individual predictions.
● By using Gradient Boosting Classifier
of 70.03%

THE ENSEMBLE
● An Ensemble takes the output of other
classifier and then applies a majority
voting to the outputs of the classifier
to determine the output.
● By using the Ensemble model on the
above discussed classifiers we were
able to achieve an accuracy of
71.10%

THE FINAL RESULTS
● By using the ensemble, we were
actually able to increase our efficiency
by nearly 1% in each case irrespective
of the performance of the individual
classifiers.
● The maximum obtainable accuracy
that was shown during the
experiments was 73.19% by the
Ensemble model.

73.188406 %The maximum Accuracy Achieved

USEFUL LINKS
� Github - https://github.com/nitishjain2007/Gender_Identification
� Youtube -
� Slideshare -
� Website - http://nitishjain2007.github.io/Gender_Identification/
� Dropbox -

REFERENCES
� http://u.cs.biu.ac.il/~koppel/papers/malefemalellcfinal.pdf
� http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile
/208/537
� http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf
� http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589-
121/cheng2011genderidentification.pdf

Gender Detection In Blogs [Information Retrival and Extraction]

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Gender Detection In Blogs [Information Retrival and Extraction]

Similar to Gender Detection In Blogs [Information Retrival and Extraction] (20)

Recently uploaded

Recently uploaded (20)

Gender Detection In Blogs [Information Retrival and Extraction]