SlideShare a Scribd company logo
Classification of Hindi
Literature according to Author
Writing Style
Dhruv Anand Srijan Shetty
11251 11727
Motivation
➔ Document Fraud Detection
➔ Classifying works from unknown authors
➔ From a Literary perspective
◆ Repeating trends of authors
◆ Adopting styles of popular authors
Previous Work
➔ Extensive work done on Author Attribution for
English (using domain-specific datasets like
blogs, emails, forum posts, short stories and
novels)
➔ No work has been done on Hindi datasets
➔ Various lexical and syntactic features have
been tried by researchers in this field
Challenges
➔ Non-uniform data for Hindi
➔ Variance of writing style markers in Hindi
Literature
➔ Multiple derivative words that must be
aggregated without any pre-programmed tool
for lemmatization. (The language is
morphologically rich.)
Problem Statement
➔ Apply known methods of Author Attribution
to a Hindi dataset
➔ Analyse difference in effectiveness of various
methods between English and Hindi
➔ Exploring new types of lexical and syntactic
features to give better results for Hindi
Literature
Methodology
Proposed Features
➔ Word n-grams
◆ Stemmed/non-stemmed unigrams
◆ Collocations (bigrams)
➔ Character n-grams
➔ Sentence length distribution
➔ Word length distribution
➔ Feature word frequency distribution
*image from
[Sta09]
Classification
➔ Supervised
◆ SVMs
◆ Bayesian Multinomial Regression (BMR)
➔ Unsupervised
◆ K-means clustering
Framework
Stage 2
Classification
Stage 1
Feature Extraction
Text Snippets
Feature
Specification
Feature Vectors
Stage 3
Evaluation
Results
Label Assignment
A bit of theory
Bag of Words
http://www.python-course.eu/images/document_representation.png
(http://www.mathworks.com/matlabcentral/fileexchange/screenshots/2240/original.jpg)
K Means
http://www.thebookmyproject.com/wp-content/uploads/Intrusion-Detection-
Technique-by-using-K-means-Fuzzy-Neural-Network-and-SVM-classifiers.jpg
SVM
http://upload.wikimedia.org/math/2/e/e/2eeac600b65d77080381284f530f37d4.png
BMR
Where do we stand
Dataset Compilation
➔ No standard dataset for classical/contemporary
hindi authors (novels and stories)
➔ Scraped HindiSamay.com manually to build a
database of Classical Hindi literature.
◆ 5 authors
◆ 2-4 lakh words per author
➔ Each author’s work has been divided into
multiple snippets of 500 words.
Unigrams
➔ Belief: Authors repeat the same set of words
➔ Stemming: BOW using all tokens and BOW
using 4500 most frequent words (>20
frequency in the entire corpus)
➔ Classification: K-means on 3 classes (RNT,
Premchand, V.N.Rai) and on 5 classes.
➔ Results for 3 classes:
◆ Average Precision: 50% (v/s baseline of 33%)
◆ Average Recall: 48% (v/s baseline of 33%)
Results with 5 authors
0 1 2 3 4 Snippets Precision Recall
RNT 111 14 20 0 6 151 22.65% 73.5%
Prem 108 21 58 0 211 398 71.77% 53.01%
Dharamvir 11 24 14 150 2 201 100% 74.6%
Sarat 142 332 3 0 65 542 82.19% 61.25%
VN 118 13 277 0 10 418 74.46% 66.26%
Insights
➔ Corpus has mostly stories for Rabindranath
Tagore, both recall and precision for him are
low indicating that across multiple works
frequent words used by author change.
➔ Corpus contained only novels for Premchand
and so both recall and precision for him were
high > 70%
➔ The corpus contained essays by V.N.Rai,
indicating high amount of content words.
Future Work
In the coming weeks
➔ Use collocations (bigrams) to as a feature.
➔ Analyzing sentence structure:
◆ Sentence lengths
◆ Number of subjects, verbs, objects in a sentence (instead
of POS tagging we will lookup common words from
HindiWordNet)
➔ Reducing dimensionality using PCA.
➔ Training on multiple features together (using multivariate
discriminant analysis)
➔ Improving results by tuning snippet length and parameters
used in classification.
In the future
➔ Exploring the possibility of using a morphological
tagger to get more accurate style measures for authors.
➔ Extending the method to Hindi tweets, forum
comments and messages to compare accuracy.
References
Literature
1. [KSA09] Moshe Koppel, Jonathan Schler, and Shlomo
Argamon. Computational methods in authorship
attribution. J. Am. Soc. Inf. Sci. Technol., 60(1):9-26, January
2009.
2. [KSA11] Moshe Koppel, Jonathan Schler, and Shlomo
Argamon. Authorship attribution in the wild. Lang.Resour.
Eval., 45(1):83-94, March 2011.
3. [Sta09] Efstathios Stamatatos. A survey of modern
authorship attribution methods. J. Am. Soc. Inf. Sci. Technol.,
60(3):538-556, March 2009.
Tools Used
➔ ZSH
➔ Python Modules
◆ indicngram
◆ nltk, scipy, scikit-learn
➔ Snippets of code have been taken from
◆ http://www.csc.villanova.
edu/~matuszek/spring2012/snippets.html
*www.python.org
THANK YOU!
Questions?

More Related Content

Similar to Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature

Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
ankit_ppt
 
Developing writing skills - The process approach
Developing writing skills - The process approachDeveloping writing skills - The process approach
Developing writing skills - The process approach
Xavier Pradheep Singh
 
Assesing writing
Assesing writing Assesing writing
Assesing writing
niapancali
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
Ashis Kumar Chanda
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
Nick Grattan
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attribution
Reza Ramezani
 
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
CSIT8
 
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
ijaia
 
natural language processing
natural language processing natural language processing
natural language processing
shakeelAsghar6
 
Common Core Connections
Common Core ConnectionsCommon Core Connections
Common Core Connections
Jason Stephenson
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text Summarization
IJERD Editor
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
aashnareddy1
 
Using selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectivesUsing selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectives
Andrés Vargas
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
Ashis Chanda
 
Table of specifications unit test & perf task
Table of specifications unit test & perf taskTable of specifications unit test & perf task
Table of specifications unit test & perf task
yassan_jacinto
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
kokanechandrakant
 
Use of paraphracing in Research Writing.pptx
Use of paraphracing in Research Writing.pptxUse of paraphracing in Research Writing.pptx
Use of paraphracing in Research Writing.pptx
GopiDervaliya
 

Similar to Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature (20)

Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Developing writing skills - The process approach
Developing writing skills - The process approachDeveloping writing skills - The process approach
Developing writing skills - The process approach
 
Assesing writing
Assesing writing Assesing writing
Assesing writing
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attribution
 
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
 
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...
 
natural language processing
natural language processing natural language processing
natural language processing
 
Common Core Connections
Common Core ConnectionsCommon Core Connections
Common Core Connections
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text Summarization
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
 
sa.ppt
sa.pptsa.ppt
sa.ppt
 
Using selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectivesUsing selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectives
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
 
Table of specifications unit test & perf task
Table of specifications unit test & perf taskTable of specifications unit test & perf task
Table of specifications unit test & perf task
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
 
Use of paraphracing in Research Writing.pptx
Use of paraphracing in Research Writing.pptxUse of paraphracing in Research Writing.pptx
Use of paraphracing in Research Writing.pptx
 

Recently uploaded

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 

Recently uploaded (20)

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 

Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature

  • 1. Classification of Hindi Literature according to Author Writing Style Dhruv Anand Srijan Shetty 11251 11727
  • 2. Motivation ➔ Document Fraud Detection ➔ Classifying works from unknown authors ➔ From a Literary perspective ◆ Repeating trends of authors ◆ Adopting styles of popular authors
  • 3. Previous Work ➔ Extensive work done on Author Attribution for English (using domain-specific datasets like blogs, emails, forum posts, short stories and novels) ➔ No work has been done on Hindi datasets ➔ Various lexical and syntactic features have been tried by researchers in this field
  • 4. Challenges ➔ Non-uniform data for Hindi ➔ Variance of writing style markers in Hindi Literature ➔ Multiple derivative words that must be aggregated without any pre-programmed tool for lemmatization. (The language is morphologically rich.)
  • 5. Problem Statement ➔ Apply known methods of Author Attribution to a Hindi dataset ➔ Analyse difference in effectiveness of various methods between English and Hindi ➔ Exploring new types of lexical and syntactic features to give better results for Hindi Literature
  • 7. Proposed Features ➔ Word n-grams ◆ Stemmed/non-stemmed unigrams ◆ Collocations (bigrams) ➔ Character n-grams ➔ Sentence length distribution ➔ Word length distribution ➔ Feature word frequency distribution
  • 9. Classification ➔ Supervised ◆ SVMs ◆ Bayesian Multinomial Regression (BMR) ➔ Unsupervised ◆ K-means clustering
  • 10. Framework Stage 2 Classification Stage 1 Feature Extraction Text Snippets Feature Specification Feature Vectors Stage 3 Evaluation Results Label Assignment
  • 11. A bit of theory
  • 16. Where do we stand
  • 17. Dataset Compilation ➔ No standard dataset for classical/contemporary hindi authors (novels and stories) ➔ Scraped HindiSamay.com manually to build a database of Classical Hindi literature. ◆ 5 authors ◆ 2-4 lakh words per author ➔ Each author’s work has been divided into multiple snippets of 500 words.
  • 18. Unigrams ➔ Belief: Authors repeat the same set of words ➔ Stemming: BOW using all tokens and BOW using 4500 most frequent words (>20 frequency in the entire corpus) ➔ Classification: K-means on 3 classes (RNT, Premchand, V.N.Rai) and on 5 classes. ➔ Results for 3 classes: ◆ Average Precision: 50% (v/s baseline of 33%) ◆ Average Recall: 48% (v/s baseline of 33%)
  • 19. Results with 5 authors 0 1 2 3 4 Snippets Precision Recall RNT 111 14 20 0 6 151 22.65% 73.5% Prem 108 21 58 0 211 398 71.77% 53.01% Dharamvir 11 24 14 150 2 201 100% 74.6% Sarat 142 332 3 0 65 542 82.19% 61.25% VN 118 13 277 0 10 418 74.46% 66.26%
  • 20. Insights ➔ Corpus has mostly stories for Rabindranath Tagore, both recall and precision for him are low indicating that across multiple works frequent words used by author change. ➔ Corpus contained only novels for Premchand and so both recall and precision for him were high > 70% ➔ The corpus contained essays by V.N.Rai, indicating high amount of content words.
  • 22. In the coming weeks ➔ Use collocations (bigrams) to as a feature. ➔ Analyzing sentence structure: ◆ Sentence lengths ◆ Number of subjects, verbs, objects in a sentence (instead of POS tagging we will lookup common words from HindiWordNet) ➔ Reducing dimensionality using PCA. ➔ Training on multiple features together (using multivariate discriminant analysis) ➔ Improving results by tuning snippet length and parameters used in classification.
  • 23. In the future ➔ Exploring the possibility of using a morphological tagger to get more accurate style measures for authors. ➔ Extending the method to Hindi tweets, forum comments and messages to compare accuracy.
  • 25. Literature 1. [KSA09] Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol., 60(1):9-26, January 2009. 2. [KSA11] Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Authorship attribution in the wild. Lang.Resour. Eval., 45(1):83-94, March 2011. 3. [Sta09] Efstathios Stamatatos. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol., 60(3):538-556, March 2009.
  • 26. Tools Used ➔ ZSH ➔ Python Modules ◆ indicngram ◆ nltk, scipy, scikit-learn ➔ Snippets of code have been taken from ◆ http://www.csc.villanova. edu/~matuszek/spring2012/snippets.html *www.python.org