Join Priyanka as she explores the use of data science in predicting heart disease. This presentation covers the methodologies, algorithms, and data analysis techniques employed to forecast heart disease risks. Gain insights into data preprocessing, feature selection, model building, and evaluation. ...
Join Priyanka as she explores the use of data science in predicting heart disease. This presentation covers the methodologies, algorithms, and data analysis techniques employed to forecast heart disease risks. Gain insights into data preprocessing, feature selection, model building, and evaluation. Discover how predictive analytics can play a crucial role in early detection and prevention of heart disease. Ideal for students and professionals interested in healthcare analytics and data science applications. for more information visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Size: 955.82 KB
Language: en
Added: Jul 31, 2024
Slides: 26 pages
Slide Content
Heart Disease Prediction
Abstract This project focuses on predicting heart disease using various machine learning models. The goal is to evaluate the performance of different algorithms and identify the most accurate model for predicting the presence of heart disease. The dataset used for this study is publicly available and contains various medical features that are indicative of heart disease.
Table of Contents Introduction Objective Methodology Data Preprocessing Exploratory Data Analysis Feature Selection Model Training and Evaluation Results and Discussion Conclusion
Introduction Heart disease is a leading cause of death worldwide. Early detection and intervention can significantly improve patient outcomes. This project aims to leverage machine learning techniques to predict the presence of heart disease based on medical data. The objectives of this project are to compare the performance of various machine learning models and identify the best-performing model.
Objective The field of heart disease prediction using machine learning has seen significant advancements with the application of various algorithms, datasets, and techniques. The common approaches, including supervised learning, feature selection, and data preprocessing, play crucial roles in enhancing model performance. Ensemble methods and deep learning models have shown great promise in achieving high accuracy. Future research should focus on improving data quality, incorporating more features, and making models interpretable for practical clinical use.
Methodology The methodology section describes the overall approach taken to achieve the project objectives. This includes data collection, preprocessing, model selection, training, evaluation, and comparison.
Data Preprocessing The dataset used in this project is sourced from a publicly available heart disease dataset. Data preprocessing steps include handling missing values, removing duplicate entries, and scaling features.
Exploratory Data Analysis
Box Plot of the Numeric Columns
This Are The Boxplot After Removing The Duplicates
Feature Selection Feature selection involves choosing relevant features for model training. In this project, all features except the target variable are used for training.
Model Training and Evaluation Multiple machine learning models are trained and evaluated on the dataset. The models used include Logistic Regression, Decision Tree, Random Forest, SVM, KNN, Gradient Boosting, AdaBoost, Naive Bayes, and MLP Neural Network.
Training of model without duplicates
Training of model with duplicates
Results and Discussion Model Performance without duplicates : Without duplicates, the Decision tree model performed better with testing accuracy of 90.32%, achieved precision (0.91), recall (0.90), and F1-score (0.91), showing reasonable performance but some variability. The Ada boost model had testing accuracy of 83.87% with precision (0.94), recall (0.93), and F1-score (0.93), indicating stronger generalization. Model Performance with duplicates : With duplicates, the Random Forest model achieved near-perfect scores: precision (0.99), recall (0.99), and F1-score (0.99), testing accuracy of 98.05% indicating overfitting. The Decision Tree model also showed inflated metrics: precision (0.95), recall (0.94), and F1-score (0.94) and testing accuracy of 97.08%.
Future Improvements: Future work could involve hyperparameter tuning for the best-performing models to further improve accuracy. Developing a real-time prediction system for practical deployment.
Conclusion In this heart disease prediction project, models like Random Forest and Decision tree performed exceptionally well with duplicates, achieving high precision, recall, and F1-scores. Given the sensitivity and critical nature of healthcare data, maintaining duplicates may be necessary to preserve valuable information. The inflated performance metrics indicate that these models can accurately predict heart disease when all data points are considered. Thus, despite the risk of overfitting, keeping duplicates ensures that the models leverage all available data, leading to better predictive accuracy in this healthcare context. This approach highlights the balance between data preprocessing and maintaining data integrity in sensitive applications.