2
Guided by : Dr. Nakul Sharma
1. Devkinandan Jagtap
2. Shweta Ambekar
3. Harshit Singh
Our methodology introduces an intelligent system for analyzing
writing styles using stylometric analysis, evaluating clustering algorithms
like k-means, k-means++, hierarchical, and DBSCAN. We use silhouette
scores to ensure effective differentiation based on linguistic and structural
features.
3
4
• Lexical Features
Measuring word usage patterns, average word length, and syllables per word.
• Vocabulary Richness Features
Evaluating the complexity and diversity of a text's vocabulary.
• Readability Scores
Assessing the level of difficulty or simplicity of a text.
Stylometric analysis is the study of linguistic and structural features in text
to identify patterns unique to individual authors or groups of authors.
5
KMeans
Popular clustering
method that
separates data into K
clusters based on
similarities between
them.
KMeans++
An improvement
over Kmeans in
terms of selecting
initial centroids
Clustering is a machine learning technique that groups similar data points together
based on certain features or characteristics.
6
Hierarchical
Clustering
Creates a hierarchy
of clusters that
resembles a tree,
commonly used for
clustering.
DBSCAN
Density-based
algorithm identifying
clusters based on
data point density
within a specified
radius.
7
Silhouette Score
Silhouette Score quantifies the cohesion within clusters by
measuring the proximity of data points to their own cluster
compared to other clusters. It ranges from -1 to 1, where higher values
indicate better cluster cohesion.
8
1. Name Two Different Styles (k = 2)
Separate stories of one author and poem of another author in
two distinct clusters.
2. Different Author Text (k = 2)
Also separates in clusters but with decreased silhouette
value.
3. Increased Clusters (k = 5)
Clusters are very close to each other with decreased silhouette
score
9
Data Collection
Data
Preprocessing
Feature Extraction
Small dataset of
10 samples
Extracting text
from PDF and
obtaining font
information
Stylometry
analysis on
extracted text
data
Enhancing Plagiarism Detection
Extending the work to enhance
existing plagiarism detection
algorithms.
Academic Settings
Useful in academic settings for
detecting plagiarism between
assignments by detecting style
similarities.
Optimizing Methodology
Optimizing the current
methodology for good accuracy
as the value of k increases.
9
10
11
Optimized
System
Works properly for
determining different
writing styles when k
= 2.
Clustering
Algorithms
KMeans and
KMeans++ algorithms
provide nearly the
same output.
Future
Research
Opportunities to add
more parameters
and optimize the
methodology for
good accuracy.
11
https://ieeexplore.ieee.org/document/10482055

An Approach to Detecting Writing Styles Based on Clustering Techniques

  • 2.
    2 Guided by :Dr. Nakul Sharma 1. Devkinandan Jagtap 2. Shweta Ambekar 3. Harshit Singh
  • 3.
    Our methodology introducesan intelligent system for analyzing writing styles using stylometric analysis, evaluating clustering algorithms like k-means, k-means++, hierarchical, and DBSCAN. We use silhouette scores to ensure effective differentiation based on linguistic and structural features. 3
  • 4.
    4 • Lexical Features Measuringword usage patterns, average word length, and syllables per word. • Vocabulary Richness Features Evaluating the complexity and diversity of a text's vocabulary. • Readability Scores Assessing the level of difficulty or simplicity of a text. Stylometric analysis is the study of linguistic and structural features in text to identify patterns unique to individual authors or groups of authors.
  • 5.
    5 KMeans Popular clustering method that separatesdata into K clusters based on similarities between them. KMeans++ An improvement over Kmeans in terms of selecting initial centroids Clustering is a machine learning technique that groups similar data points together based on certain features or characteristics.
  • 6.
    6 Hierarchical Clustering Creates a hierarchy ofclusters that resembles a tree, commonly used for clustering. DBSCAN Density-based algorithm identifying clusters based on data point density within a specified radius.
  • 7.
    7 Silhouette Score Silhouette Scorequantifies the cohesion within clusters by measuring the proximity of data points to their own cluster compared to other clusters. It ranges from -1 to 1, where higher values indicate better cluster cohesion.
  • 8.
    8 1. Name TwoDifferent Styles (k = 2) Separate stories of one author and poem of another author in two distinct clusters. 2. Different Author Text (k = 2) Also separates in clusters but with decreased silhouette value. 3. Increased Clusters (k = 5) Clusters are very close to each other with decreased silhouette score
  • 9.
    9 Data Collection Data Preprocessing Feature Extraction Smalldataset of 10 samples Extracting text from PDF and obtaining font information Stylometry analysis on extracted text data
  • 10.
    Enhancing Plagiarism Detection Extendingthe work to enhance existing plagiarism detection algorithms. Academic Settings Useful in academic settings for detecting plagiarism between assignments by detecting style similarities. Optimizing Methodology Optimizing the current methodology for good accuracy as the value of k increases. 9 10
  • 11.
    11 Optimized System Works properly for determiningdifferent writing styles when k = 2. Clustering Algorithms KMeans and KMeans++ algorithms provide nearly the same output. Future Research Opportunities to add more parameters and optimize the methodology for good accuracy.
  • 12.