This project aimed to create a good-performing classification model for the diagnosis of breast cancer for the Wisconsin Diagnostic Breast Cancer dataset using a minimum number of features. This was investigated by finding the strongest predictors and their contributions to the prediction using deci...
This project aimed to create a good-performing classification model for the diagnosis of breast cancer for the Wisconsin Diagnostic Breast Cancer dataset using a minimum number of features. This was investigated by finding the strongest predictors and their contributions to the prediction using decision-tree-based models.
Size: 8.7 MB
Language: en
Added: Oct 02, 2024
Slides: 22 pages
Slide Content
The Curious Case of Wisconsin Diagonstic Dataset (WDBC)
Starring
Dataset Overview No missing/null attribute values. 32 Attributes 30 real valued input features ID (only one image per ID) CLASS DISTRIBUTION Diagnosis ( Benign or Malignant) Benign 357 62.7% Malignant 212 37.3% Total 569 100%
1. Multi collinear Features area_se perimeter_mean texture_worst perimeter_worst area_mean concavity_mean perimeter_se radius_worst concave_points_worst area_worst compactness_worst Pearson Coefficient >= 0.85
What if I told you using only 8 out of 30 attributes would give you a decent result ? The choice is yours !
The mean accuracy of Random Forest using only these 8 features was 95.575 % . Hyper parameters max_features=0.5 max_samples=228 min_samples_leaf=5 n_estimators=40 Split: 80:20
when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable. Datadive: http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
Random Forest Hyper parameters max_features=0.5 max_samples=228 min_samples_leaf=5 n_estimators=40 # Features used: 30 Mean Accuracy: 97.34% Top 10 features
Removing low importance features Threshold: 0.005 # Features used: 13 Mean Accuracy: 97.34% Top 10 features
Remove potentially redundant variables one at a time . Result texture_worst: 0.9364035087719298 texture_mean: 0.9473684210526315 radius_se: 0.9407894736842105 area_se: 0.9473684210526315 radius_worst: 0.9407894736842105 area_worst: 0.9385964912280702 concavity_worst: 0.9451754385964912 concavity_mean: 0.949561403508771 9 concave points_worst: 0.9429824561403509 concave points_mean: 0.9364035087719298
Partial Dependence (of top 3 features)
Waterfall plot
Model Interpretation. Questions How confident are we in our predictions using a particular row of data? For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction? Which columns are the strongest predictors, which can we ignore? Which columns are effectively redundant with each other, for purposes of prediction? How do predictions vary, as we vary these columns?
Source: Analytics Vidhya
OOB (Out of Bag) Score prevents leakage and gives a better model with low variance which is necessarily not the case using cross-validation technique. Also, works best for small and medium datasets.