A tool for Detecting Source Code Plagarism-SourcePlag
ntu727
15 views
9 slides
Mar 02, 2025
Slide 1 of 9
1
2
3
4
5
6
7
8
9
About This Presentation
Paper Accepted At ICBDS-2024.
Size: 1.72 MB
Language: en
Added: Mar 02, 2025
Slides: 9 pages
Slide Content
SourcePlag A Source Code Plagiarism Detector 2024 IEEE International Conference on Blockchain and Distributed System (IEEE ICBDS 2024) Nakul Sharma, Siddharth Shinde, Swarup Bhosale, Suyog Patil Vishwakarma Institute of Information Technology
Abstract Code plagiarism poses a significant challenge in programming communities, necessitating effective detection mechanisms. This paper introduces a novel system that employs Abstract Syntax Trees (ASTs) for code representation and comparison. The system utilizes ASTs to capture the structural essence of code, facilitating a comprehensive analysis of code similarity, it offers flexibility for application across multiple programming languages. In addition to leveraging the Levenshtein Distance Algorithm for Python code comparison, the system incorporates node counting for other languages such as Java and C/C++. By integrating AST-based representation and a combination of comparison techniques, the system offers a robust solution for identifying plagiarized code instances accurately across various programming environments. Through a detailed exploration of the system's methodology, this paper underscores its potential to address the pervasive issue of code plagiarism in programming communities .
Motivation: Why is the problem of interest? Prior studies’ historical context to your research An overview of the work, results, and contributions How the article is organized Introduction Importance of Academic Integrity : Ensuring originality in software is crucial in both academic and professional environments. Limitations of Traditional Methods : Basic textual comparisons are often insufficient for detecting source code plagiarism, especially with code modifications like renaming variables or altering formatting. Role of Abstract Syntax Trees (ASTs) : ASTs provide a structural representation of the code, capturing its logical flow beyond mere text. Plagiarism Detection Using ASTs : AST-based analysis helps detect plagiarism attempts, even when obfuscation techniques are used, making it more reliable. Objective : This paper presents a robust method to detect source code plagiarism by leveraging ASTs, offering a more effective solution than text-based methods.
Motivation: Why is the problem of interest? Prior studies’ historical context to your research An overview of the work, results, and contributions How the article is organized Related work Winnowing Algorithm Based Models: The approach segments data into N-grams, selects the least costly hash per segment as a fingerprint, and hashes it to create unique identifiers. Techniques like MOSS use cosine similarity and other methods to enhance plagiarism detection accuracy. Abstract Syntax Tree Based Models: AST-based models for plagiarism detection include DECKARD, which uses Euclidean distance and LSH for efficient code comparison, and Greenan’s AST-based exact matching with the Smith Waterman Algorithm. Chilowicz’s tool combines hashing and AST, utilizing cryptographic hash functions for subtree matching. CodEx uses AST and hashing to measure node contributions with a Weight-Based Depth First Search, generating similarity scores. These models enhance plagiarism detection efficiency.
Methodology Overview for Source Code Plagiarism Detection Input: The system accepts source code files written in Python, Java, or C++. Preprocessing: Comments and unnecessary whitespace are removed from the source code to standardize the input across languages. AST Generation: The Abstract Syntax Tree (AST) is generated from the preprocessed code to convert it into a structured form for further analysis. Similarity Analysis: The system takes two approaches based on the programming language: For Python : It applies the Levenshtein Distance algorithm to calculate the similarity between code sequences. For Java/C++ : It uses Node Counting within the AST to assess structural similarity. Similarity Score & Report Generation: The results from both Python and Java/C++ analyses are combined to generate a final similarity score and plagiarism report.
Results/discussion show Experiments & Analysis: We tested the system on a dataset of source code files across Python , Java , and C++ . The dataset included pairs of source code samples with known levels of similarity, ranging from identical copies to functionally similar but structurally different code. We evaluated the effectiveness of the system using Levenshtein Distance for Python and Node Counting for Java/C++.
Results/discussion show Discussion: Interpretation of Results The Levenshtein Distance for Python proved effective in detecting plagiarized code even with minor changes such as variable renaming or formatting differences. Node Counting for Java/C++ was particularly robust in identifying structural similarities, such as function or class rearrangements, making it an excellent fit for these languages. Overall, the system performed well across different programming languages and code structures, demonstrating its potential to accurately detect source code plagiarism in varied scenarios.
Conclusion: An online assignment plagiarism checker is a crucial resource for preserving the integrity of education. It empowers both educators and students to uphold the values of originality and honesty. By discouraging plagiarism, it fosters a deeper engagement with learning, ensuring that academic assessments are a true reflection of students' knowledge and skills. This tool not only supports the academic community but also helps educational institutions maintain their reputation for excellence and ethical scholarship. Future Scope: Improving and developing a plagiarism detector for source code In real world scenarios, software projects often involve multiple languages or components written in different languages. By supporting multiple languages, the tool can accommodate the complexities of modern software development practices and provide practical solutions for plagiarism detection in heterogeneous codebases. Conclusion and Future Scope