Comparison of classification models using diabetes data - yogi.pptx

swethamurugan8113 44 views 23 slides Jun 21, 2024
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

comparison og classification models


Slide Content

Comparison of classification models using diabetes data BY, Yogipriyadharshan S FINAL YEAR/AI&DS/KVCET Supervisor, Dr.R.Delshi Howsalya Devi HOD/AI&DS/KVCET

outline Abstract Objectives Introduction Literature review Proposed system Workflow diagram Modules explanation Output-screenshots Conclusion References

ABSTRACT: The comparison of classification models involves evaluating their performance based on key criteria. This study undertakes a comprehensive comparison of classification models using a dataset related to diabetes. The models evaluated include random forest, decision tree, support vector machine, and logistic regression. Key performance metrics, such as accuracy, precision, recall, and F1 score, are considered. The analysis delves into the intricacies revealed by the confusion matrix and assesses the models' ability to distinguish between diabetes types using ROC curves and AUC. Factors like interpretability, computational efficiency, robustness across datasets, handling imbalanced data, scalability, and feature importance are crucial considerations. The study aims to provide insights into selecting the most effective classification model for diabetes-related tasks, taking into account specific data characteristics and priorities for accuracy and interpretability.

OBJECTIVES: To perform statistical tests to determine if the performance differences between models are statistically significant. To choose the most suitable model for a specific task based on its performance and suitability for the problem at hand. Healthcare professionals can employ classification models to aid in the diagnosis of diseases or medical conditions. The comparison of models helps ensure accurate and early detection of illnesses. To suggest potential areas for future work, such as exploring advanced modeling techniques or improving data quality.

introduction Diabetes mellitus, a widespread chronic metabolic disorder, poses significant challenges globally. With the increasing global prevalence of diabetes, the demand for accurate classification models is critical. This project thoroughly explores classification models – random forest, decision tree, support vector machine, and logistic regression – using a diabetes-linked dataset. The study aims to assess model performance in predicting and distinguishing between type 1 and type 2 diabetes. And to provide insights into selecting the most effective classification model for diabetes-related tasks, taking into account specific data characteristics and priorities for accuracy and interpretability.

REVIEW OF LITERATURE Title/Journal Year/Author Techniques/ Algorithm Remarks Diabetes prediction using machine learning techniques 2020 / Mothusi soni , sunita varma SVM, DT, KNN, random forest, logistic regression, and gradient boosting Random forest achieved higher accuracy compared to other machine learning techniques. Diabetes prediction using machine learning techniques 2018/ Tejas , N. Joshi, prof.Pramila.M , & chawan SVM, logistic regression, and ANN By using 998records, logistic regression gives 81% accuracy. A comparative approach for pima indians diabetes diagnosis using IDA  2014/ Parashar . A, Burse. K, & rawat.K Support Vector Machine and Feed Forward Neural Network By using 768 records, it gives 77% accuracy.

PROPOSED SYSTEM Implement robust data preprocessing steps to handle missing values, outliers, and feature scaling. Ensuring a standardized preprocessing pipeline enhances the comparability of models. Employ advanced cross-validation techniques, such as k-fold cross-validation, to ensure reliable performance evaluation across different subsets of the dataset. Utilize a comprehensive set of evaluation metrics such as accuracy, precision, recall, f1 score, and area under the roc curve ( auc -roc). Integrate visualization tools such as matplotlib , seaborn , or other suitable libraries to create clear and informative visualizations.

Workflow diagram

modules INPUT MODULE: Diabetes dataset PROCESSING MODULE: Python programming language CLASSIFICATION MODULE: Machine learning algorithms Support vector machine Decision tree Logistic regression Random forest OUTPUT MODULE: Matplotlib Seaborn

Modules explanation Diabetes dataset : This dataset is originally from the national institute of diabetes and digestive and kidney diseases ( kaggle ). The objective of this study is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. PYTHON PROGRAMMING: Python is a versatile programming language widely employed in data visualization. Leveraging libraries such as Matplotlib , Seaborn , and Plotly , Python facilitates the creation of insightful visual representations of data. Its simplicity and extensive community support make Python a preferred choice for generating a diverse range of visualizations, from basic plots to intricate charts, aiding in effective data analysis and communication.

Machine learning : Machine learning plays a pivotal role in the analysis of diabetes data. Utilizing machine learning techniques, such as classification algorithms, the project aims to develop models that can accurately predict. Algorithms like logistic regression, decision trees, support vector machines, and random forests are employed to explore patterns within the dataset. SUPPORT VECTOR MACHINE: Support Vector Machine (SVM), a classification algorithm, is commonly used in diabetes data analysis. Like logistic regression, SVM is part of comparing models for diabetes classification . The goal is to determine how well SVM distinguishes between individuals with and without diabetes based on various features.

Decision tree: The decision tree algorithm is crucial in comparing classification models for diabetes data, forming a tree structure based on features to make decisions. Decision trees contribute interpretability to the comparison, offering insights into factors influencing diabetes classification. LOGISTICS REGRESSION: Logistic regression, a binary classification method, is commonly used to compare models in diabetes data analysis. Evaluation metrics like accuracy, precision, recall, F1 score, and AUC-ROC are employed to assess model performance .

Random forest: Random forest, an ensemble learning algorithm, is vital in comparing classification models for diabetes data. It creates multiple decision trees during training, addressing overfitting and enhancing accuracy. MATPLOTLIB AND SEABORN: Matplotlib is a popular Python library for creating static, animated, and interactive visualizations in a variety of formats. It provides a flexible and comprehensive set of plotting tools for producing high-quality plots, charts, and figures . Seaborn is also a Python data visualization library built on top of Matplotlib . It provides a high-level interface for creating attractive and informative statistical graphics. Seaborn simplifies the process of generating complex visualizations by offering several built-in themes and color palettes.

Output-SCREENSHOTS INPUT MODULE-SCREENSHOT

Logistic regression model:

Decision tree model output:

Random forest model output:

Support vector machine model output:

PLOT OF COMPARISON OF MODEL ACCURACY :

Conclusion: In this study, the random forest approach produces better accuracy than other algorithms like logistic regression, support vector machines, and decision trees, according to a comparison of classification models for diabetes data . Because of Random Forest's strong ensemble learning capabilities, it can identify intricate patterns in the data and perform better when it comes to diabetes categorization . Although every algorithm has advantages and disadvantages, the project's conclusions indicate that Random Forest is the best option for this particular dataset in terms of obtaining high accuracy when classifying people as either diabetic or not . The significance of selecting an algorithm based on the features of the current dataset is highlighted by this conclusion.

REFERENCE: Mothusi soni,sunita varma 2020, ‘diabetes prediction using machine learning techniques’, int. Journal of engineering research and application, vol. 9, issue 09.   Tejas , N. Joshi, prof. Pramila , M, & chawan 2018, ‘diabetes prediction using machine learning techniques’, int . Journal of engineering research and application, vol. 8, issue 1, (part -II), pp.09-13.   Parashar , A, burse, K, & rawat , K 2014, ‘A comparative approach for pima indians diabetes diagnosis using lda -support vector machine and feed forward neural network’, international journal of advanced research in computer science and software engineering, vol.4(11), pp.378-383. Python data analytics, data analysis and science using, pandas, matplotlib , and the python programming language,2012.

Thank you

QUERIES ?