An Approach to Detecting Writing Styles Based on Clustering Techniques

ambekarshweta25 45 views 12 slides May 31, 2024
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:

-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:

VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated text...


Slide Content

An Approach to Detecting Writing Styles Based on Clustering Techniques

2 Group Members Guided by : Dr. Nakul Sharma Devkinandan Jagtap Shweta Ambekar Harshit Singh Vishwakarma Institute Of Information Technology, Pune

Introduction Our methodology introduces an intelligent system for analyzing writing styles using stylometric analysis, evaluating clustering algorithms like k-means, k-means++, hierarchical, and DBSCAN. We use silhouette scores to ensure effective differentiation based on linguistic and structural features. 3

4 Lexical Features Measuring word usage patterns, average word length, and syllables per word. Stylometric Analysis Vocabulary Richness Features Evaluating the complexity and diversity of a text's vocabulary. Readability Scores Assessing the level of difficulty or simplicity of a text. Stylometric analysis is the study of linguistic and structural features in text to identify patterns unique to individual authors or groups of authors.

Clustering Algorithms 5 KMeans Popular clustering method that separates data into K clusters based on similarities between them. KMeans ++ An improvement over Kmeans in terms of selecting initial centroids Clustering is a machine learning technique that groups similar data points together based on certain features or characteristics.

Clustering Algorithms Continued… 6 Hierarchical Clustering Creates a hierarchy of clusters that resembles a tree, commonly used for clustering. DBSCAN Density-based algorithm identifying clusters based on data point density within a specified radius.

Performance Metrics 7 Silhouette Score Silhouette Score quantifies the cohesion within clusters by measuring the proximity of data points to their own cluster compared to other clusters. It ranges from -1 to 1, where higher values indicate better cluster cohesion.

Experiment Results 8 Name Two Different Styles (k = 2) Separate stories of one author and poem of another author in two distinct clusters. Different Author Text (k = 2) Also separates in clusters but with decreased silhouette value. Increased Clusters (k = 5) Clusters are very close to each other with decreased silhouette score

Dataset and Feature Extraction 9 Data Collection Data Preprocessing Feature Extraction Small dataset of 10 samples Extracting text from PDF and obtaining font information Stylometry analysis on extracted text data

Enhancing Plagiarism Detection Extending the work to enhance existing plagiarism detection algorithms. Academic Settings Useful in academic settings for detecting plagiarism between assignments by detecting style similarities. Optimizing Methodology Optimizing the current methodology for good accuracy as the value of k increases. Future Scope 9 10

Conclusion 11 Optimized System Works properly for determining different writing styles when k = 2. Clustering Algorithms KMeans and KMeans ++ algorithms provide nearly the same output. Future Research Opportunities to add more parameters and optimize the methodology for good accuracy.

Link of Research Paper 11 https://ieeexplore.ieee.org/document/10482055