4. ABSTRACT
● Through the sands of time, textual content has remained a
prominent feature of internet media especially BLOGS.
● Thus, author profiling and attribution becomes an important
and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it
should be the author itself.
● But . . .
● lot of content brings a lot of responsibility
5. Given a text blog , can we identify whether
the writer is a male or a female ?
The Question
8. THE APPROACH
● An ensemble is applied on these models and the input
document is classified as written by male or female.
● We take advantage of the linguistic features of the
blog and create a feature file.
● This feature file is then trained on various classifier and a
model for each of the classifier is prepared.
10. ● each document contains text of about ~35 blogs
in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset
● Koppels blog dataset
● contains about 19 thousand document
11. PARSING
● Language used : Python
● Each blog is entry stored in XML format
<Blog>
<date>....... </date>
<post>
….
</post>
...
<Blog>
● Each of the blog filename contains the name and Gender
of the author
13. FEATURES
For our task of Gender Identification, we take the help of
the following linguistic features:
● Character Based Features
● Word Based Features
● Syntactic Features
● Structural Features
● Function Words
● POS Start Probability
15. THE CLASSIFICATION TASK
For the task of classification, we used several classifying
algorithms and arrived at a model that uses ensemble of the
following classification algorithms:
● Random Forest Classifier
● Neural Networks Classifier
● Adaboost Tree Classifier
● Gradient Boosting Classifier
● Bagging Classifier
16. THE CLASSIFICATION TASK
For each of the classifier
● We fed it with partial features to actually see the variation
of accuracies with the features.
● We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the
majority class from the classified results of the classifiers.
17. RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number
of decision tree classifiers on various
sub-samples of the dataset
● By using Random Forest Classifier we
were able to achieve an accuracy of
69.79%
18. NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes
with each layer fully connected to the
next layer nodes and each node is a
neuron with non-linear perceptron.
● Uses a supervised learning called
backpropagation for training the
network.
● By using Neural Networks Classifier
we were able to achieve an accuracy
of 69.51%
19. ADABOOST TREE CLASSIFIER
● An meta estimator that begins by
fitting a classifier on the original
dataset and then fits the next round
classifiers on the same dataset
● By using Adaboost tree Classifier we
were able to achieve an accuracy of
69.57%
20. GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise
fashion.
● In each of the next stages weak
classifiers are introduced to
compensate the shortcomings of the
existing weak learners and these
shortcomings are identified by the
gradients.
● By using Gradient Boosting Classifier
we were able to achieve an accuracy
of 70.81%
21. BAGGING CLASSIFIER
● A meta estimator that fits the base
classifiers each on random subsets of
the datasets and then aggregate their
individual predictions.
● By using Gradient Boosting Classifier
we were able to achieve an accuracy
of 70.03%
22. THE ENSEMBLE
● An Ensemble takes the output of other
classifier and then applies a majority
voting to the outputs of the classifier
to determine the output.
● By using the Ensemble model on the
above discussed classifiers we were
able to achieve an accuracy of
71.10%
24. THE FINAL RESULTS
● By using the ensemble, we were
actually able to increase our efficiency
by nearly 1% in each case irrespective
of the performance of the individual
classifiers.
● The maximum obtainable accuracy
that was shown during the
experiments was 73.19% by the
Ensemble model.