toxic commnets classification using python

hamed0432 180 views 28 slides Jun 13, 2024
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

hi


Slide Content

eliminary Screening Project Title ---- Toxic Comment Detection   Presented by :- Under the Supervision of:

Contents: Introduction Motivation Problem statement Literature survey Meeting Details Workload Distribution Project Planning Screenshot of  Approval of the Certificate of the Project Report Methodology Used Solution Approached Algorithms and Framework Outcome Produced Proof of the Outcome

Introduction Texts that are considered toxic are those that are impolite, show disrespect, or have a tendency to drive away from the conversation.  On various social networks, news websites, and online forums, we might be able to have healthier discussions if these toxic texts can be automatically identified. These texts contain dangers such as high-toxicity texts that lead to personal insults, online abuse, and bullying habits that are harmful to a person's psychological health and emotional well-being.  Many people refrain from expressing themselves and give up on expressing themselves because they are afraid of online harassment and bullying. An automated system must be formulated to keep away, remove, or identify such harmful content from online sites. But developing such a toxicity identification system is a difficult task for online platform providers.  Natural language processing provides a helping hand in the identification of toxicity in texts expressed as images or texts. The detection of insulting comments is a critical area of research in natural language processing.  The primary goal is to assess the toxicity and habits expressed in words and their contexts.  The objective of this paper is to propose a model to detect toxic or non-toxic texts with higher accuracy.

Motivation People refrain from expressing themselves due to toxicity on social media affecting their emotional and mental well-being. A system must be developed to identify such toxicity in texts.

Problem statement To propose a model that will help users to stay away from the toxic environment that exist on the social media in the form of text. To propose a model for identification of toxicity i.e., toxic or non- toxic in texts. To propose a model with the high-rate accuracy.

Literature Survey S.NO.   NAME   AUTHOR OBJECTIVE ALGORITHM DATASET CONCLUSION  DRAWBACKS 1. Keeping Children safe online with limited resources: Analyzing what is seen and what is heard. ALEKSANDAR JEVREMOVIC, MLADEN VEINOVIC , MILAN CABARKAPA Designed a framework(Casper) which will directly analyzes at the content what the user sees and hears. BERT, for images CNN, LSTM, BLSTM. Twitter sexism parsed You tube parsed Toxicity parsed Attack parsed Accuracy- 95%  Audio Accuracy-91% Online grooming and self-harm detection are their future focus.

S.NO. NAME AUTHOR OBJECTIVE ALGORITHM DATASET CONCLUSION DRAWBACKS 2. Text Mining and text analytics of research articles  Akshaya Udgave           and  Prasanna Kulkarni   To analyze the use of text mining techniques, and to explore recent developments in the field of design science.  Text mining, NLP In the future, different design algorithms would be helpful in resolving various issues in the text mining field.  Integration of domain information, varying granularity principles, refinement of text in multilingual type and ambiguity in the handling of the natural language are major problems and challenges that emerge throughout the text extraction or mining phase. 

S.NO. NAME AUTHOR OBJECTIVE ALGORITHMS DATASET CONCLUSIONS DRAWBACKS 3. Multilingual Sentiment Analysis and Toxicity Detection for Text Messages in Russian Darya Bogoradnikova , Olesia Makhnytkina , Anton Matveev, Anastasia Zakharova, Artem Akulov In this paper, they discuss an approach to sentiment analysis and emotion identification for user comments. 1.Text pre-processing 2.Data Augmentation 3.Sentiment analysis 4. Detection of toxic comments 5. Detection of toxic spans. The dataset contains 1703 user reviews in Russian from two online education platforms: Coursera and Stepik Finally, they achieved a complex solution for evaluating users’ opinions about online-courses.

S.NO. NAME AUTHOR OBJECTIVE ALGORITHM DATASET CONCLUSION  DRAWBACKS 4. Comment toxicity detection via a multichannel convolutional bidirectional gated recurrent unit Ashok Kumar J, Abirami, Tina Esther Trueman , Erik Cambria b, To check toxicity of the neural network using ML algorithims Natural language processing, MCBiGRU model , CNN  223; 549 instances with six labels, namely, toxic, obscene, severe toxic, insult, threat, and identity hate. These labels define an instance as toxicity or non-toxicity.  achieve better training and testing accuracy than the existing models using only n-gram word embeddings. the proposed MCBiGRU model outperforms the existing results.  ----

S.NO. NAME AUTHOR OBJECTIVE ALGORITHMS DATASET CONCLUSIONS DRAWBACKS 5. Detecting Islamic Radicalism Arabic Tweets Using Natural Language Processing KHALID T. MURSI ,MOHAMMAD D. ALAHMADI, FAISAL S. ALSUBAEI ,AND AHMED S. ALGHAMDI To automate the process of detecting hateful tweets, utilized advanced Machine Learning (ML) techniques and perform sentiment analysis to capture the meaning of the Arabic words in a proper word embedding (Word2Vec) Word2vec 100,000 tweets of the last decade. Determined the most frequent terminologies in the radical tweets of each year which include some Jihadist groups, Countries, and Individuals.  This work can help law enforcement to analyze and detect extremism in social media. Small dataset. The proposed paper has low range of radical keywords

S.NO. NAME AUTHOR OBJECTIVE ALGORITHMS DATASET CONCLUSIONS DRAWBACKS 6 Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings FATIMA SHANNAQ  , BASSAM HAMMO , HOSSAM FARIS  , AND PEDRO A.  Detect offensive tweets using SVM XGBoost SVM(support Vector Machine) ArCybC dataset an intelligent prediction system to detect the offensive language in Arabic tweets has been presented Dataset of ARCybC is small, effectiveness towards bid dataset is to measured,.

S.NO. NAME AUTHOR OBJECTIVE ALGORITHMS DATASET CONCLUSIONS DRAWBACKS 7 A Framework for Hate Speech Detection Using Deep Convolutional Neural Network Pradeep kumar roy To monitor user’s posts and filter the hate speech related post before it is spread.   Deep Convolutional Neural Network (DCNN) used 10 fold cross- validation used with the proposed DCNN and achieved the best prediction recall value of 0.88 for hate speech and 0.99 for non hate speech It can predict only 53% of tweets of his correctly in the dataset because of the inbalance in the dataset ( baise towards non hate tweets). Images can be also used for the same.

S.NO. NAME AUTHOR OBJECTIVE ALGORITHM DATASET CONCLUSION  DRAWBACKS 8 An Assessment of Deep Learning Models and Word Embeddings for Toxicity Detection within Online Textual Comments Danilo Dessì   Diego Reforgiato Recupero and Harald Sack 1 Uses multiple deep learning models in multiple tests for checking the toxicity of the text. Natural language processing , Sentiment Analyis , Emotion Detection. CNN BERT LSTM Kaggle based dataset. LSTM-based model is the first choice among the experimented models to detect toxicity.  how various word embeddings may represent the domain knowledge in a variety of ways, and an unique model for all cases might be insufficient.  failure of BERT embeddings

S.NO. NAME AUTHOR OBJECTIVE DATASET ALGORITHM CONCLUSION DRAWBACKS 9 Toxic comments detection using LSTM Krishna Dubey, Rahul Nair This paper aims to achieve text mining and making use of deep learning models that can nearly accurate classify given text is toxic or not. ML algorithm, LSTM, NLP, artificial neural network Accuracy-94% Could have been more precise and ELMOL model has not being very used to detect the problem.

S.NO. NAME AUTHOR OBJECTIVE DATASET ALORITHM CONCLUSION DRAWBACKS 10 Detecting Toxic Remarks in Online Conversation Pushpit Gautam This project aims to establish toxicity classification scheme in online comments based on vocabulary and other characteristics in a sentence Kaggle competition multi label Wikipedia talk page edit dataset Naïve bayes, Gaussian naïve bayes,  Support vector machine,    Back propagation neural network It has been observed that the label power set method with multinomial naïve could be used for finding the toxic comments with more than one type. Dataset used in this had more than 1.5 Lakh comments and due to this kernel was frequently getting down a lot errors. Implementation of Adaboost in scikit learn library so that it could be used directly for multilabel classification problems.

S.NO. NAME AUTHOR OBJECTIVE DATASET ALORITHM CONCLUSION DRAWBACKS 11. Detect Toxic Content to Improve Online Conversations  Deepshi Mediratta , Nikhil oswal Train online text to detect offensive content SVM, Naïve Bayes, GRU and LSTM GRU using GloVe embedding provided the best result ( Accuracy = 89.49, F1 score = 0.72)  dataset provided is highly imbalanced, The data also contains noise, questions not classified correctly by humans, 

S.NO. NAME AUTHOR OBJECTIVE DATASET ALORITHM CONCLUSION DRAWBACKS 12 Convolutional Neural Networks for Toxic Comment Classification Spiros V. Georgakopoulos Perform text mining using CNN Convolutional neural network, word2vec CNN can outperform well established methodologies providing enough evidence that their use is appropriate for toxic comment classification Promising results are motivating for further development of CNN based methodologies for text mining in the near future, in our interest, employing methods for adaptive learning and providing further comparisons with n-gram based approaches

S.NO. NAME AUTHOR OBJECTIVE DATASET ALORITHM CONCLUSION DRAWBACKS 13 Machine learning methods for toxic comment classification: a systematic review Darco  Arcocez Toxic comment or reply detection using machine learning RPART, SVM and GLM evaluated 62 classifiers representing 19 major algorithmic families against features extracted from the Jigsaw dataset of Wikipedia comments .compared the classifiers based on statistically significant differences in accuracy and relative execution time.

S.NO. NAME AUTHOR OBJECTIVE DATASET ALORITHM CONCLUSION DRAWBACKS 14 A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution Guizhe Song , Degen Huang and Zhifeng Xiao  Use machine learning for toxic text detection in in uneven dataset XLM- RoBERTa ; MBERT Part of English training corpus is divided into multiple languages Sample size reconstruction is required.

Work load distribution Serial No. Team  Member Role to be assigned 1. Vishwajeet Kumar Research work, coding and Documentation  2. Ashwani Tyagi Coding and concerned Research, Product Design 3. Arpit Rao Research , Testing  coding and Product review

Project Planning

Methodology

 RAW    DATA TEXT PRE-PROCESSING FEATURE EXTRACTION TRAINING DATA TEST DATA CLASSIFICATION BINARY CLASSIFICATION TOXIC TEXT NON-TOXIC TEXT Solution Approach: -  

RAW DATA: We have first collected  the dataset from kaggle . We have  selected the dataset of Twitter. PRE-PROCESSING: We have edited, cleansed and modified the data  in this step and the steps are shown.  FEATURE EXTRACTION: We have seen what features has been there in the data in this step before training and testing the data. TRAIN and TEST: We have divided the dataset into two subsets train and test. CLASSIFICATION: For classification we have used Linear Regression , CNN, LSTM. And the used classifier has detected the text is toxic or non-toxic text.

Algorithms and Framework

Outcome Produced The expected outcome of this project is a research paper that we have submitted on IEEE explore .

Proof of the Outcome

Thank you
Tags