Through age internet has evolved. Due to this it has changed the way people communicate. The introduction of instant messaging, forums, social networking and blogs has made it possible for people of every age to become authors. However, this growth also encourages various kinds of misuses. Online communities are vulnerable to deceptive attacks, receiving false information. Hence, author attribution becomes an important task to different between original writer and imposter. This project tries to answer the simple question "Given the text document, can we identify the author i.e. male or female?"
4. ABSTRACT
�Through the sands of time, textual content has remained a
prominent feature of internet media especially BLOGS.
�Thus, author profiling and attribution becomes an important
and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it
should be the author itself.
● But . . .
● lot of content brings a lot of responsibility
5. Given a text blog , can we identify whether
the writer is a male or a female ?
The Question
8. THE APPROACH
�An ensemble is applied on these models and the input
document is classified as written by male or female.
● We take advantage of the linguistic features of the
blog and create a feature file.
● This feature file is then trained on various classifier and a
model for each of the classifier is prepared.
10. � each document contains text of about ~35 blogs
in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset
● Koppels blog dataset
● contains about 19 thousand document
11. PARSING
● Language used : Python
● Each blog is entry stored in XML format
<Blog>
<date>....... </date>
<post>
….
</post>
...
<Blog>
● Each of the blog filename contains the name and Gender
of the author
13. FEATURES
For our task of Gender Identification, we take the help of
the following linguistic features:
�Character Based Features
�Word Based Features
�Syntactic Features
�Structural Features
�Function Words
�POS Start Probability
15. THE CLASSIFICATION TASK
For the task of classification, we used several classifying
algorithms and arrived at a model that uses ensemble of the
following classification algorithms:
�Random Forest Classifier
�Neural Networks Classifier
�Adaboost Tree Classifier
�Gradient Boosting Classifier
�Bagging Classifier
16. THE CLASSIFICATION TASK
For each of the classifier
�We fed it with partial features to actually see the variation
of accuracies with the features.
�We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the
majority class from the classified results of the classifiers.
17. RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number
of decision tree classifiers on various
sub-samples of the dataset
● By using Random Forest Classifier we
were able to achieve an accuracy of
69.79%
18. NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes
with each layer fully connected to the
next layer nodes and each node is a
neuron with non-linear perceptron.
● Uses a supervised learning called
backpropagation for training the
network.
● By using Neural Networks Classifier
we were able to achieve an accuracy
of 69.51%
19. ADABOOST TREE CLASSIFIER
● An meta estimator that begins by
fitting a classifier on the original
dataset and then fits the next round
classifiers on the same dataset
● By using Adaboost tree Classifier we
were able to achieve an accuracy of
69.57%
20. GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise
fashion.
● In each of the next stages weak
classifiers are introduced to
compensate the shortcomings of the
existing weak learners and these
shortcomings are identified by the
gradients.
● By using Gradient Boosting Classifier
we were able to achieve an accuracy
of 70.81%
21. BAGGING CLASSIFIER
● A meta estimator that fits the base
classifiers each on random subsets of
the datasets and then aggregate their
individual predictions.
● By using Gradient Boosting Classifier
we were able to achieve an accuracy
of 70.03%
22. THE ENSEMBLE
● An Ensemble takes the output of other
classifier and then applies a majority
voting to the outputs of the classifier
to determine the output.
● By using the Ensemble model on the
above discussed classifiers we were
able to achieve an accuracy of
71.10%
24. THE FINAL RESULTS
● By using the ensemble, we were
actually able to increase our efficiency
by nearly 1% in each case irrespective
of the performance of the individual
classifiers.
● The maximum obtainable accuracy
that was shown during the
experiments was 73.19% by the
Ensemble model.