VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Size: 938.34 KB
Language: en
Added: May 31, 2024
Slides: 12 pages
Slide Content
An Approach to Detecting Writing Styles Based on Clustering Techniques
2 Group Members Guided by : Dr. Nakul Sharma Devkinandan Jagtap Shweta Ambekar Harshit Singh Vishwakarma Institute Of Information Technology, Pune
Introduction Our methodology introduces an intelligent system for analyzing writing styles using stylometric analysis, evaluating clustering algorithms like k-means, k-means++, hierarchical, and DBSCAN. We use silhouette scores to ensure effective differentiation based on linguistic and structural features. 3
4 Lexical Features Measuring word usage patterns, average word length, and syllables per word. Stylometric Analysis Vocabulary Richness Features Evaluating the complexity and diversity of a text's vocabulary. Readability Scores Assessing the level of difficulty or simplicity of a text. Stylometric analysis is the study of linguistic and structural features in text to identify patterns unique to individual authors or groups of authors.
Clustering Algorithms 5 KMeans Popular clustering method that separates data into K clusters based on similarities between them. KMeans ++ An improvement over Kmeans in terms of selecting initial centroids Clustering is a machine learning technique that groups similar data points together based on certain features or characteristics.
Clustering Algorithms Continued… 6 Hierarchical Clustering Creates a hierarchy of clusters that resembles a tree, commonly used for clustering. DBSCAN Density-based algorithm identifying clusters based on data point density within a specified radius.
Performance Metrics 7 Silhouette Score Silhouette Score quantifies the cohesion within clusters by measuring the proximity of data points to their own cluster compared to other clusters. It ranges from -1 to 1, where higher values indicate better cluster cohesion.
Experiment Results 8 Name Two Different Styles (k = 2) Separate stories of one author and poem of another author in two distinct clusters. Different Author Text (k = 2) Also separates in clusters but with decreased silhouette value. Increased Clusters (k = 5) Clusters are very close to each other with decreased silhouette score
Dataset and Feature Extraction 9 Data Collection Data Preprocessing Feature Extraction Small dataset of 10 samples Extracting text from PDF and obtaining font information Stylometry analysis on extracted text data
Enhancing Plagiarism Detection Extending the work to enhance existing plagiarism detection algorithms. Academic Settings Useful in academic settings for detecting plagiarism between assignments by detecting style similarities. Optimizing Methodology Optimizing the current methodology for good accuracy as the value of k increases. Future Scope 9 10
Conclusion 11 Optimized System Works properly for determining different writing styles when k = 2. Clustering Algorithms KMeans and KMeans ++ algorithms provide nearly the same output. Future Research Opportunities to add more parameters and optimize the methodology for good accuracy.
Link of Research Paper 11 https://ieeexplore.ieee.org/document/10482055