Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature
This document discusses classifying Hindi literature according to author writing style. It aims to apply author attribution methods used for English texts to a Hindi dataset. The document outlines previous work on author attribution, challenges for Hindi like non-uniform data and derivative words. It proposes using word and character n-grams, sentence length distributions and word frequencies as features for classification models like SVMs and Bayesian Multinomial Regression. An initial dataset was compiled and initial results using word unigrams achieved 50% average precision and 48% average recall for 3 authors, and over 70% precision and 50% recall for 2 authors. Future work aims to explore additional features and classification methods.
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Saurav Jha
In this paper, we present two approaches to automatically classify the gender of blog authors: the first is a manual feature extraction based system incorporating two novel feature classes: variable length character sequence patterns and thirteen new word classes, along with an added class of surface features while the second is a first-ever application of a memory variant of Recurrent Neural Networks, i.e. Bidirectional Long Short Term Memory Networks (BLSTMs) on this task. We use two blog data sets to report our results: the first is a well-explored one used by the previous state-of-the-art models while the other is a 20 times larger corpus. For the first system, we use a voting of machine learning classifiers to obtain an improved accuracy with respect to the previous feature mining systems on the former data set. Using our second approach, we show that the accuracy obtained using such deep LSTMs is comparable to the current state-of-the-art deep learning system for the task of gender classification. Finally, we carry out a comparative study of performance of both the systems on the two data sets.
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Saurav Jha
In this paper, we present two approaches to automatically classify the gender of blog authors: the first is a manual feature extraction based system incorporating two novel feature classes: variable length character sequence patterns and thirteen new word classes, along with an added class of surface features while the second is a first-ever application of a memory variant of Recurrent Neural Networks, i.e. Bidirectional Long Short Term Memory Networks (BLSTMs) on this task. We use two blog data sets to report our results: the first is a well-explored one used by the previous state-of-the-art models while the other is a 20 times larger corpus. For the first system, we use a voting of machine learning classifiers to obtain an improved accuracy with respect to the previous feature mining systems on the former data set. Using our second approach, we show that the accuracy obtained using such deep LSTMs is comparable to the current state-of-the-art deep learning system for the task of gender classification. Finally, we carry out a comparative study of performance of both the systems on the two data sets.
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...CSIT8
In this paper, an investigation was done to identify writing style features that can be used for cross-topic
and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style
features were empirically evaluated that were previously used in single topic and single genre documents
for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN
CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether
combining some of these feature sets can help improve the authorship identification task. Three different
classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that
a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross
topic and cross genre authorship identification, as it achieved an AUC result of 0.837.
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...ijaia
In this paper, an investigation was done to identify writing style features that can be used for cross-topic
and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style
features were empirically evaluated that were previously used in single topic and single genre documents
for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN
CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether
combining some of these feature sets can help improve the authorship identification task. Three different
classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that
a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross
topic and cross genre authorship identification, as it achieved an AUC result of 0.837.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
More Related Content
Similar to Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...CSIT8
In this paper, an investigation was done to identify writing style features that can be used for cross-topic
and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style
features were empirically evaluated that were previously used in single topic and single genre documents
for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN
CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether
combining some of these feature sets can help improve the authorship identification task. Three different
classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that
a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross
topic and cross genre authorship identification, as it achieved an AUC result of 0.837.
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...ijaia
In this paper, an investigation was done to identify writing style features that can be used for cross-topic
and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style
features were empirically evaluated that were previously used in single topic and single genre documents
for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN
CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether
combining some of these feature sets can help improve the authorship identification task. Three different
classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that
a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross
topic and cross genre authorship identification, as it achieved an AUC result of 0.837.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Use of paraphracing in Research Writing.pptxGopiDervaliya
Use of paraphracing in Research Writing
Similar to Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature (20)
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature
2. Motivation
➔ Document Fraud Detection
➔ Classifying works from unknown authors
➔ From a Literary perspective
◆ Repeating trends of authors
◆ Adopting styles of popular authors
3. Previous Work
➔ Extensive work done on Author Attribution for
English (using domain-specific datasets like
blogs, emails, forum posts, short stories and
novels)
➔ No work has been done on Hindi datasets
➔ Various lexical and syntactic features have
been tried by researchers in this field
4. Challenges
➔ Non-uniform data for Hindi
➔ Variance of writing style markers in Hindi
Literature
➔ Multiple derivative words that must be
aggregated without any pre-programmed tool
for lemmatization. (The language is
morphologically rich.)
5. Problem Statement
➔ Apply known methods of Author Attribution
to a Hindi dataset
➔ Analyse difference in effectiveness of various
methods between English and Hindi
➔ Exploring new types of lexical and syntactic
features to give better results for Hindi
Literature
7. Proposed Features
➔ Word n-grams
◆ Stemmed/non-stemmed unigrams
◆ Collocations (bigrams)
➔ Character n-grams
➔ Sentence length distribution
➔ Word length distribution
➔ Feature word frequency distribution
17. Dataset Compilation
➔ No standard dataset for classical/contemporary
hindi authors (novels and stories)
➔ Scraped HindiSamay.com manually to build a
database of Classical Hindi literature.
◆ 5 authors
◆ 2-4 lakh words per author
➔ Each author’s work has been divided into
multiple snippets of 500 words.
18. Unigrams
➔ Belief: Authors repeat the same set of words
➔ Stemming: BOW using all tokens and BOW
using 4500 most frequent words (>20
frequency in the entire corpus)
➔ Classification: K-means on 3 classes (RNT,
Premchand, V.N.Rai) and on 5 classes.
➔ Results for 3 classes:
◆ Average Precision: 50% (v/s baseline of 33%)
◆ Average Recall: 48% (v/s baseline of 33%)
20. Insights
➔ Corpus has mostly stories for Rabindranath
Tagore, both recall and precision for him are
low indicating that across multiple works
frequent words used by author change.
➔ Corpus contained only novels for Premchand
and so both recall and precision for him were
high > 70%
➔ The corpus contained essays by V.N.Rai,
indicating high amount of content words.
22. In the coming weeks
➔ Use collocations (bigrams) to as a feature.
➔ Analyzing sentence structure:
◆ Sentence lengths
◆ Number of subjects, verbs, objects in a sentence (instead
of POS tagging we will lookup common words from
HindiWordNet)
➔ Reducing dimensionality using PCA.
➔ Training on multiple features together (using multivariate
discriminant analysis)
➔ Improving results by tuning snippet length and parameters
used in classification.
23. In the future
➔ Exploring the possibility of using a morphological
tagger to get more accurate style measures for authors.
➔ Extending the method to Hindi tweets, forum
comments and messages to compare accuracy.
25. Literature
1. [KSA09] Moshe Koppel, Jonathan Schler, and Shlomo
Argamon. Computational methods in authorship
attribution. J. Am. Soc. Inf. Sci. Technol., 60(1):9-26, January
2009.
2. [KSA11] Moshe Koppel, Jonathan Schler, and Shlomo
Argamon. Authorship attribution in the wild. Lang.Resour.
Eval., 45(1):83-94, March 2011.
3. [Sta09] Efstathios Stamatatos. A survey of modern
authorship attribution methods. J. Am. Soc. Inf. Sci. Technol.,
60(3):538-556, March 2009.
26. Tools Used
➔ ZSH
➔ Python Modules
◆ indicngram
◆ nltk, scipy, scikit-learn
➔ Snippets of code have been taken from
◆ http://www.csc.villanova.
edu/~matuszek/spring2012/snippets.html
*www.python.org