The curious case of wisconsin diagonstic dataset

SabinBhatta 38 views 22 slides Oct 02, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

This project aimed to create a good-performing classification model for the diagnosis of breast cancer for the Wisconsin Diagnostic Breast Cancer dataset using a minimum number of features. This was investigated by finding the strongest predictors and their contributions to the prediction using deci...


Slide Content

The Curious Case of Wisconsin Diagonstic Dataset (WDBC)

Starring

Dataset Overview No missing/null attribute values. 32 Attributes 30 real valued input features ID (only one image per ID) CLASS DISTRIBUTION Diagnosis ( Benign or Malignant) Benign 357 62.7% Malignant 212 37.3% Total 569 100%

1. Multi collinear Features area_se perimeter_mean texture_worst perimeter_worst area_mean concavity_mean perimeter_se radius_worst concave_points_worst area_worst compactness_worst Pearson Coefficient >= 0.85

What if I told you using only 8 out of 30 attributes would give you a decent result ? The choice is yours !

Just Eight! perimeter_worst area_worst concave points_mean perimeter_mean concavity_worst texture_worst compactness_worst radius_se

The mean accuracy of Random Forest using only these 8 features was 95.575 % . Hyper parameters max_features=0.5 max_samples=228 min_samples_leaf=5 n_estimators=40 Split: 80:20

when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable. Datadive: http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

Random Forest Hyper parameters max_features=0.5 max_samples=228 min_samples_leaf=5 n_estimators=40 # Features used: 30 Mean Accuracy: 97.34% Top 10 features

Removing low importance features Threshold: 0.005 # Features used: 13 Mean Accuracy: 97.34% Top 10 features

Removing low Redundant features OOB Score: 0.9342 (Base Line) texture_worst , texture_mean radius_se , area_se radius_worst , area_worst concavity_worst , concavity_mean concave points_worst , concave points_mean

Remove potentially redundant variables one at a time . Result texture_worst: 0.9364035087719298 texture_mean: 0.9473684210526315 radius_se: 0.9407894736842105 area_se: 0.9473684210526315 radius_worst: 0.9407894736842105 area_worst: 0.9385964912280702 concavity_worst: 0.9451754385964912 concavity_mean: 0.949561403508771 9 concave points_worst: 0.9429824561403509 concave points_mean: 0.9364035087719298

Partial Dependence (of top 3 features)

Waterfall plot

Model Interpretation. Questions How confident are we in our predictions using a particular row of data? For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction? Which columns are the strongest predictors, which can we ignore? Which columns are effectively redundant with each other, for purposes of prediction? How do predictions vary, as we vary these columns?

Source: Analytics Vidhya

OOB (Out of Bag) Score prevents leakage and gives a better model with low variance which is necessarily not the case using cross-validation technique. Also, works best for small and medium datasets.

Thank You!