An Approach to Detecting Writing Styles Based on Clustering Techniques

2
Guided by : Dr. Nakul Sharma
1. Devkinandan Jagtap
2. Shweta Ambekar
3. Harshit Singh

Our methodology introduces an intelligent system for analyzing
writing styles using stylometric analysis, evaluating clustering algorithms
like k-means, k-means++, hierarchical, and DBSCAN. We use silhouette
scores to ensure effective differentiation based on linguistic and structural
features.
3

4
• Lexical Features
Measuring word usage patterns, average word length, and syllables per word.
• Vocabulary Richness Features
Evaluating the complexity and diversity of a text's vocabulary.
• Readability Scores
Assessing the level of difficulty or simplicity of a text.
Stylometric analysis is the study of linguistic and structural features in text
to identify patterns unique to individual authors or groups of authors.

5
KMeans
Popular clustering
method that
separates data into K
clusters based on
similarities between
them.
KMeans++
An improvement
over Kmeans in
terms of selecting
initial centroids
Clustering is a machine learning technique that groups similar data points together
based on certain features or characteristics.

6
Hierarchical
Clustering
Creates a hierarchy
of clusters that
resembles a tree,
commonly used for
clustering.
DBSCAN
Density-based
algorithm identifying
clusters based on
data point density
within a specified
radius.

7
Silhouette Score
Silhouette Score quantifies the cohesion within clusters by
measuring the proximity of data points to their own cluster
compared to other clusters. It ranges from -1 to 1, where higher values
indicate better cluster cohesion.

8
1. Name Two Different Styles (k = 2)
Separate stories of one author and poem of another author in
two distinct clusters.
2. Different Author Text (k = 2)
Also separates in clusters but with decreased silhouette
value.
3. Increased Clusters (k = 5)
Clusters are very close to each other with decreased silhouette
score

9
Data Collection
Data
Preprocessing
Feature Extraction
Small dataset of
10 samples
Extracting text
from PDF and
obtaining font
information
Stylometry
analysis on
extracted text
data

Enhancing Plagiarism Detection
Extending the work to enhance
existing plagiarism detection
algorithms.
Academic Settings
Useful in academic settings for
detecting plagiarism between
assignments by detecting style
similarities.
Optimizing Methodology
Optimizing the current
methodology for good accuracy
as the value of k increases.
9
10

11
Optimized
System
Works properly for
determining different
writing styles when k
= 2.
Clustering
Algorithms
KMeans and
KMeans++ algorithms
provide nearly the
same output.
Future
Research
Opportunities to add
more parameters
and optimize the
methodology for
good accuracy.

11
https://ieeexplore.ieee.org/document/10482055

An Approach to Detecting Writing Styles Based on Clustering Techniques

More Related Content

Recently uploaded

Featured

An Approach to Detecting Writing Styles Based on Clustering Techniques