Diabetes prediction using Machine Leanring and Data Preprocessing techniques
PriyanshPatel66
89 views
40 slides
Jul 25, 2024
Slide 1 of 40
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
About This Presentation
BITS Hyderabad Summer Term Course Project Final presentation which surveryed, implemented some papers on Diabetes prediction using ML techniques
Size: 1.39 MB
Language: en
Added: Jul 25, 2024
Slides: 40 pages
Slide Content
Improving Diabetes Prediction Accuracy using Machine Learning & Data Analysis Techniques Arjun Srivastava 2021B2A42490P Priyansh Patel 2021A7PS2682P BITS Pilani Hyderabad Campus BITS F464 Machine Learning Grp 12 Research Project Presentation
Importance of Solving Late Detection of Diabetes Diabetes Complications Major diabetes complications include Cardiovascular disease, Peripheral arterial disease, Diabetic retinopathy, Diabetic nephropathy, Diabetic neuropathy, etc. Early detection is crucial to prevent or delay the progression of these serious complications. Machine Learning for Early Detection ML-based risk prediction models can analyze routine blood work to identify early biomarkers of complications. Integrating Automated screening processes using ML can increase access and reduce costs, thus empowering healthcare providers and patients both.
Literature Review - 1 Explore efficacy of ensemble learning, specifically boosting techniques (including XGBoost, AdaBoost, Gradient Boosting, and CatBoost) for diabetes prediction using the Pima Indian Diabetes dataset. Emphasis was placed on data preprocessing, including handling missing values, feature scaling, and addressing class imbalance through techniques like SMOTE. Hyperparameter optimization with GridSearchCV and evaluation through K-Fold cross-validation were conducted to ensure model robustness. The paper claimed that Boosting techniques significantly improved prediction accuracy, demonstrating the potential of ensemble methods in medical diagnosis applications.
Literature Review - 2 I ntegrates machine learning with explainable AI (XAI) techniques for diabetes prediction, emphasizing model interpretability. Models used: Logistic regression, KNN, Decision tree, Voting, Random Forest, Bagging , SVM, AdaBoost, XGBoost. Highlights the importance of model interpretability to enhance utility in clinical settings and build confidence among clinicians and patients. Techniques like SHAP (SHapley Additive exPlanations) were used to provide insights into feature importance and model decisions, making predictions more transparent and aligning with healthcare professionals' needs. Ensures models are practically useful for reliable and understandable AI systems.
Implementation Of Ensemble Learning Methods
Libraries used pandas: Data manipulation numpy: Numerical operations scikit-learn: Machine learning models and tools imblearn: Handling class imbalance matplotlib and seaborn: Data visualization xgboost, lightgbm, catboost: Advanced boosting models timeit: Measuring execution times for small bits of Python code. A voids a number of common traps, like removing idle time from total time.
PIMA American Indians Dataset Features Pregnancy = Number of times pregnant Glucose = Plasma glucose concentration Blood pressure = Diastolic blood pressure Skin thickness = Triceps skin fold thickness Insulin = Participant’s insulin level Body mass index (BMI) = Body fat based on the height and weight Diabetes pedigree function (DPF) = Likelihood of diabetes based on the family history Age Diabetes Class attribute: 0 = no diabetes, 1 = diabetes 0 or 1
Data Preprocessing Handling missing values print ( "Null values in features" ) print ( data . isnull (). sum ()) (No Missing values found, so no samples removed/imputed) Splitting the dataset into features and target variables # Separate features and target X = data . drop ( 'Outcome' , axis = 1 ) y = data [ 'Outcome' ]
U psampling using SMOTE Upsampling (oversampling) is a data processing and optimization technique that addresses class imbalance in a dataset by adding data. It adds data by using original samples from minority classes until all classes are equal in size. An imbalanced dataset is defined as a dataset in which one class is greatly underrepresented in the dataset relative to the true population, creating unintended bias. For Example, imagine a model is trained to classify images as showing a cat or a dog. The dataset used is composed of 90% cats and 10% dogs. Cats in this scenario are overrepresented, and if we have a classifier predicting cats every time, it will yield a 90% accuracy for classifying cats, but 0% accuracy for classifying dogs. The imbalanced dataset in this case will cause classifiers to favor accuracy for the majority class at the expense of the minority class.
Advantages No Information Loss: Unlike downsampling, upsampling generates new data points, avoiding any information loss. Increase Data at Low Costs : Often the only way to increase dataset size on demand in cases where data can only be acquired through observation. For instance, certain medical conditions are simply too rare to allow for more data to be collected. Disadvantages Overfitting : Upsampling assumes that the existing minority class data adequately captures reality; if that is not the case, the classifier may not be able to generalize very well. Data Noise : Can increase noise in the data, reducing the classifier’s reliability and performance. Computational Complexity : Training the classifier will be more computationally expensive.
# Apply SMOTE X_resampled , y_resampled = SMOTE( random_state = 123 ).fit_resample( X , y ) Normalizing feature values is crucial as ensures that all features contribute equally to the model's learning process. It also helps gradient-based algorithms like gradient descent converge faster. # Normalize features X_scaled = StandardScaler (). fit_transform ( X_resampled ) # Check original class distribution print ( "Original class distribution:" ) print ( y . value_counts ()) 0 500 1 268 # Check new class distribution print ( "Class distribution after SMOTE:" ) print ( y_resampled .value_counts()) 1 500 0 500
Ensemble Learning Definition : Ensemble learning combines multiple models, each trained on the same data or subsets of it. It combine predictions from multiple models to make a final prediction. Some Advanced Ensemble techniques are Stacking, Bagging and Boosting.
Basic Ensemble Methods Max Voting : The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. You can consider this as taking the mode of all the predictions. Averaging : Similar to Max Voting, We take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems. Another Variant is Weighted Averaging.
Gradient Boosting Machines (GBM) GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree. Works for both regression and classification problems.
Example The mean age is assumed to be the predicted value for all observations in the dataset. The errors are calculated using this mean prediction and actual values of age. A tree model is created using the errors calculated above as target variable. Our objective is to find the best split to minimize the error. The predictions by this model are combined with the predictions 1.
This value calculated above is the new prediction. New errors are calculated using this predicted value and actual value.
Some Other Models Extreme Gradient Boosting (XGBoost): features tree pruning, regularization, Built-in Cross-Validation and parallel processing, which makes it a preferred choice for data scientists seeking robust and accurate predictive models. CatBoost: We often encounter datasets that contain categorical features and to fit these datasets into the Boosting model we apply various encoding techniques to the dataset such as One-Hot Encoding or Label Encoding. But applying One-Hot encoding creates a sparse matrix which may sometimes lead to the overfitting of the model; to handle this issue we use CatBoost. CatBoost automatically handles categorical features.
Ada Boost Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations that are incorrectly predicted, and the subsequent model works to predict these values correctly. Tuning to be done for: n_estimators, learning_rate and max_depth Steps: Initially, all observations in the dataset are given equal weights. A model is built on a subset of data and predictions are made on the whole dataset. Errors are calculated by comparing the predictions and actual values. Weights can be determined using the error value. While creating the next model, higher weights are given to the data points which were predicted incorrectly. Repeated until the error function does not change, or the maximum limit of the number of estimators is reached.
Hyperparameter tuning Used: Grid search: It involves specifying a grid of hyperparameter values and exhaustively trying every combination of these values. Random search :It selects random combinations of hyperparameter values from specified distributions.
Key Findings, Comparative Conclusions and Limitations
According to the mean accuracy score reported by Stratified 10-fold cross validation results, Gradient Boosting performed the best in our implementation with score of 0.813, instead of 0.968 reported by the paper. Although it was lower that what the paper reported, the paper also claimed it to be the best. The paper also reported all other metrics of Gradient boosting to be the best, however our implementation gave some different results. AdaBoost performed best in Precision score, followed by Light Gradient Boosting, then followed by Gradient Boosting. The Same pattern can be seen for Recall score and F1-score as well. Dataset Limitations: The dataset used in the paper is relatively small, which could affect the generalizability and robustness of the reported results. Both our study and the original paper relied on the Pima Indians Diabetes Dataset, which may limit the generalizability of the findings. The dataset's demographic and geographical limitations might affect the model's applicability to broader populations.
Lack of Detailed Hyperparameter Tuning : The paper did not extensively cover hyperparameter tuning, which is critical for optimizing model performance. Significant Differences in Performance Metrics: There is a noticeable difference between our implementation metrics and those reported in the paper. This indicates that the implementation. details and data processing steps were not fully disclosed in the paper, leading to discrepancies in the results. Further work needs to be done on improving model interpretability and validating the results on diverse and larger datasets to enhance the generalizability of the finding.
I mprovement
Methods Used Dataset Addition and Analysis: Outlier Removal Adaptive Synthetic Sampling (ADASYN) Hyperparameter Tuning Techniques Models used to improve the accuracy
Mutual Information(MI) Definition : Mutual information measures the amount of information that one r andom variable in this case, an input feature) contains about another random variable (the output or target variable). It is based on the concept of entropy function Dropped DPF due to 0 MI with the output
PCA Why not? Only 8 attributes Done generally to reduce computational power when we have 10s and 100s of data PCA becomes more advantageous as correlations approach or exceed stronger thresholds (e.g., 0.7 ) What is PCA Principal Component Analysis (PCA) reveals the relationships and correlations between attributes in a dataset. If two attributes have a very high correlation, PCA can combine them into a single component, reducing the dimensionality of the dataset.
Data Set Expansion Objective : Expand the PIMA dataset by incorporating data from the RTML database, a Bangladeshi dataset. Dataset Comparison : Found that the RTML dataset lacked the "Insulin" attribute which was present in the PIMA dataset. Data Synthesis : Used XGBoost and PIMA dataset to synthesize the "Insulin" column in RTML dataset. XGBoost was chosen for its ability to generate accurate predictions and fill in missing values effectively. Integration : Merged RTML dataset(containing “Insulin” attribute) with the PIMA dataset.
Outlier Removal Outliers were not removed during the implementation step. We used IQR method with threshold(k) 1.5 to remove the outliers What is IQR: IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data. IQR=Q3−Q1 lower_bound=Q1−k×IQR upper_bound=Q3+k×IQR
ADASYN Tried Adasyn instead of Smote for upsampling The use of ADASYN improved the K-fold cross-validation results for gradient boosting from 81.899% to 83.192%. However, it had a negative impact on the performance of all other models.
Problems with SMOTE SMOTE counters the problem of overfitting in random oversampling by adding previously unseen new data to the dataset rather than simply duplicating pre-existing data. On the other hand, SMOTE’s artificial data point generation adds extra noise to the dataset, potentially making the classifier more unstable. The synthetic points and noise from SMOTE can also inadvertently lead to overlaps between the minority and majority classes that don’t reflect reality, leading to what is called over-generalization. Adaptive Synthetic Sampling Approach (ADASYN) is similar to Borderline SMOTE in that it generates more difficult data for the model to learn. But it also aims to preserve the distribution of the minority class data.
SMOTE VS ADASYN What is Adasyn? SMOTE creates synthetic examples by joining the minority class instances with line segments in the feature space. ADASYN adds a level of adaptability by introducing some noise or perturbation around the line segments joining minority instances.
Hyperparameter Tuning Again We expanded the range of hyperparameters considered and utilized Bayesian optimization in addition to grid and random search methods. Our approach yielded superior cross-validation results compared to the hyperparameter set originally proposed in the paper. Then we trained our respective model using the best hyperparameter obtained after applying Grid CV, Random CV and Bayesian Optimization.
Bayesian Optimization. Instead of trying random combinations, Bayesian optimization uses previous results to decide where to search next. It builds a probabilistic model of the objective function (how well the model performs with different hyperparameter values) and uses this model to make intelligent decisions about which hyperparameters to try next.
Improvement Chart Improvement after -DPF drop -Data Set Expansion -Outlier Removal -Use Adasyn in GB -Hyperparameter tuning using i>Grid Search ii>Random Search iii> Bayesian Search
More Models B ase Models: Stacking starts with a set of diverse base models. These can be any machine learning models.. Each base model is trained independently on the training data and learns to make predictions. Meta Model Training: It combine these predictions to produce a final smart prediction that ideally has better performance than each individual base model. - Stacking
Stacking Random Forest SVM Meta model Logistic regression
SVM Random Forest We employed both grid search and Bayesian optimization techniques to identify the optimal hyperparameters(Kernel and regularization parameter C) model. Random Forest is an ensemble learning method. It builds multiple decision trees during training and combines their predictions to improve accuracy and robustness over a single tree. We employed random search and Bayesian optimization techniques to identify the optimal hyperparameters
Models Comparison
Conclusion Data Preprocessing : Combining the PIMA and RTML diabetes datasets, along with thorough data preprocessing (including imputation of missing values and outlier removal), significantly improved model performance. Handling Imbalanced Data : The implementation of Adaptive Synthetic Sampling ADASYN and SMOTE effectively addressed the issue of imbalanced data. Hyperparameter Tuning : Different hyperparameter tuning techniques work differently on various models. Random and Bayesian optimizations were notably effective in enhancing model performance. Ensemble Learning : Ensemble learning methods, particularly boosting and stacking techniques, demonstrated superior predictive performance. The Random Forest ensemble method, optimized with both random and Bayesian hyperparameter tuning, achieved the highest K-Fold accuracy.
Future Scope Improving Model Interpretability : Enhancing the interpretability of complex models, such as boosting techniques, is critical, especially in clinical settings where understanding model decisions is crucial for trust and adoption. Validating on Diverse Datasets : Future work should focus on validating the models on larger and more diverse datasets to ensure generalizability and robustness of the findings. Advanced Hyperparameter Optimization : Further exploration of hyperparameter tuning techniques, including automated and adaptive methods, could lead to additional improvements in model performance. Integration with Clinical Systems : Develop systems that combine predictive models with real-time clinical decision support to improve diabetes management.